Closed dhruvluci closed 6 years ago
It basically says that local file system scheme is not supported. Your config, vocab and init_checkpoint should also point to your google cloud bucket.
For e.g.
python bert/run_squad.py \ --vocab_file=gs://{your bucket name}/vocab.txt \ --bert_config_file=gs://{your bucket name}/bert_config.json \ --init_checkpoint=gs://{your bucket name}/bert_model.ckpt \ --do_train=False \ --train_file=train-v1.1.json \ --do_predict=True \ --predict_file=dev-v1.1.json \ --train_batch_size=24 \ --learning_rate=3e-5 \ --num_train_epochs=2.0 \ --max_seq_length=384 \ --doc_stride=128 \ --output_dir=gs://{your bucket name}/squad_large/ \ --use_tpu=True \ --tpu_name=grpc://{tpu_name}
When I ran below code in VM instance on TPU
python /home/schen/bert/run_squad.py \ --vocab_file=gs://{bucket_name}/uncased_L-12_H-768_A-12/vocab.txt \ --bert_config_file=gs:/{bucket_name}/uncased_L-12_H-768_A-12/bert_config.json \ --init_checkpoint=gs://{bucket_name}/uncased_L-12_H-768_A-12/bert_model.ckpt \ --do_train=True \ --do_predict=True \ --train_file=/home/schen/squad/train-v1.1.json \ --predict_file=/home/schen/squad/dev-v1.1.json \ --train_batch_size=32 \ --learning_rate=3e-5 \ --num_train_epochs=2.0 \ --max_seq_length=384 \ --doc_stride=128 \ --output_dir=gs://{bucket_name}/squad_base/ \ --use_tpu=True \ --tpu_name=ai
or replace the last flag either with
--tpu_name=grpc://ai
or
--tpu_name=grpc://{tpu_ip}:8470
I got the error as follow:
INFO:tensorflow:Error recorded from training_loop: Unsuccessful TensorSliceReader constructor: Failed to get matching files on gs://object.propel.ai/bert/uncased_L-12_H-768_A-12/bert_model.ckpt: Permission denied: Error executing an HTTP request: HTTP response code 403 with body '{ "error": { "errors": [ { "domain": "global", "reason": "forbidden", "message": "my_account_email does not have storage.objects.list access to object.propel.ai." } ], "code": 403, "message": "my_account_email does not have storage.objects.list access to object.propel.ai." } } ' when reading gs://object.propel.ai/bert/uncased_L-12_H-768_A-12 [[{{node checkpoint_initializer_139}} = RestoreV2[dtypes=[DT_FLOAT], _device="/job:worker/replica:0/task:0/device:CPU:0"](checkpoint_initializer/prefix, checkpoint_initializer_139/tensor_names, checkpoint_initializer/shape_and_slices)]]
Caused by op u'checkpoint_initializer_139', defined at:
File "/home/schen/bert/run_squad.py", line 1283, in
InvalidArgumentError (see above for traceback): Unsuccessful TensorSliceReader constructor: Failed to get matching files on gs://object.propel.ai/bert/uncased_L-12_H-768_A-12/bert_model.ckpt: Permission denied: Error executing an HTTP request: HTTP response code 403 with body '{ "error": { "errors": [ { "domain": "global", "reason": "forbidden", "message": "my_account_email does not have storage.objects.list access to object.propel.ai." } ], "code": 403, "message": "my_account_email does not have storage.objects.list access to object.propel.ai." } } ' when reading gs://object.propel.ai/bert/uncased_L-12_H-768_A-12 [[{{node checkpoint_initializer_139}} = RestoreV2[dtypes=[DT_FLOAT], _device="/job:worker/replica:0/task:0/device:CPU:0"](checkpoint_initializer/prefix, checkpoint_initializer_139/tensor_names, checkpoint_initializer/shape_and_slices)]]
INFO:tensorflow:training_loop marked as finished
WARNING:tensorflow:Reraising captured error
Traceback (most recent call last):
File "/home/schen/bert/run_squad.py", line 1283, in
Caused by op u'checkpoint_initializer_139', defined at:
File "/home/schen/bert/run_squad.py", line 1283, in
InvalidArgumentError (see above for traceback): Unsuccessful TensorSliceReader constructor: Failed to get matching files on gs://object.propel.ai/bert/uncased_L-12_H-768_A-12/bert_model.ckpt: Permission denied: Error executing an HTTP request: HTTP response code 403 with body '{ "error": { "errors": [ { "domain": "global", "reason": "forbidden", "message": "my_account_email does not have storage.objects.list access to object.propel.ai." } ], "code": 403, "message": "my_account_email does not have storage.objects.list access to object.propel.ai." } } ' when reading gs://object.propel.ai/bert/uncased_L-12_H-768_A-12 [[{{node checkpoint_initializer_139}} = RestoreV2[dtypes=[DT_FLOAT], _device="/job:worker/replica:0/task:0/device:CPU:0"](checkpoint_initializer/prefix, checkpoint_initializer_139/tensor_names, checkpoint_initializer/shape_and_slices)]]
System information
What is the top-level directory of the model you are using: google-research/bert
Here is the link: https://github.com/google-research/bert
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No OS Platform and Distribution (e.g., Linux Ubuntu 16.04): my laptop is Mac OS High Sierra (version 10.13.6). The VM instance is Linux 4.9.0-8-amd64 #1 SMP Debian 4.9.144-3.1 (2019-02-19) x86_6 TensorFlow installed from (source or binary): python -m pip install tensorflow=1.11 TensorFlow version (use command below): 1.11.0 after runing python -c "import tensorflow as tf; print(tf.version)"
If I removed the last two flags and not ran on TPU it worked properly. However, I really want to utilize TPU to speed up the computation.I have stuck on this TPU issue for a long time. When I ran another demo code bert/run_classifier.py I got the same error. It's really frustrating. Any help would be appreciated!
Getting the admin to authorize permissions for both my VM account and TPU account solved the issue.
@webstruck So I cannot reference the bert model with gs://bert_models/2018_10_18/uncased_L-12_H-768_A-12
?
The error I am having is
tensorflow.python.framework.errors_impl.InvalidArgumentError: Unsuccessful TensorSliceReader constructor: Failed to get matching files on gs://bert_models/2018_10_18/un
cased_L-12_H-768_A-12/bert_model.ckpt: Permission denied: Error executing an HTTP request: HTTP response code 403 with body '{
"error": {
"errors": [
{
"domain": "global",
"reason": "forbidden",
"message": "519163749326-compute@developer.gserviceaccount.com does not have storage.objects.list access to bert_models."
}
],
"code": 403,
"message": "519163749326-compute@developer.gserviceaccount.com does not have storage.objects.list access to bert_models."
}
}
'
when reading gs://bert_models/2018_10_18/uncased_L-12_H-768_A-12
which is generated by
export BERT_BASE_DIR=gs://bert_models/2018_10_18/uncased_L-12_H-768_A-12
export SQUAD_11_EN_DIR=gs://<my_bucket>/squad1.1
export TPU_NAME=<my_tpu>
python run_squad.py \
--vocab_file=$BERT_BASE_DIR/vocab.txt \
--bert_config_file=$BERT_BASE_DIR/bert_config.json \
--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
--do_train=True \
--train_file=$SQUAD_11_EN_DIR/train-v1.1.json \
--do_predict=True \
--predict_file=$SQUAD_11_EN_DIR/dev-v1.1.json \
--train_batch_size=8 \
--learning_rate=3e-5 \
--num_train_epochs=2.0 \
--max_seq_length=384 \
--doc_stride=128 \
--output_dir=gs://bert_deep_finder/output/ \
--use_tpu=True \
--tpu_name=$TPU_NAME
So, should I upload the model into my bucket? I cannot use the one in gs://bert_models/2018_10_18/uncased_L-12_H-768_A-12
?
Thanks
Yes, download the model and then upload into your google cloud storage bucket, set the path as environment variable or just use the absolute path.
On Mar 15, 2019, at 3:53 PM, Yari notifications@github.com wrote:
@webstruck https://github.com/webstruck So I cannot reference the bert model with gs://bert_models/2018_10_18/uncased_L-12_H-768_A-12 ?
The error I am having is
tensorflow.python.framework.errors_impl.InvalidArgumentError: Unsuccessful TensorSliceReader constructor: Failed to get matching files on gs://bert_models/2018_10_18/un cased_L-12_H-768_A-12/bert_model.ckpt: Permission denied: Error executing an HTTP request: HTTP response code 403 with body '{ "error": { "errors": [ { "domain": "global", "reason": "forbidden", "message": "519163749326-compute@developer.gserviceaccount.com does not have storage.objects.list access to bert_models." } ], "code": 403, "message": "519163749326-compute@developer.gserviceaccount.com does not have storage.objects.list access to bert_models." } } ' when reading gs://bert_models/2018_10_18/uncased_L-12_H-768_A-12
which is generated by
export BERT_BASE_DIR=gs://bert_models/2018_10_18/uncased_L-12_H-768_A-12 export SQUAD_11_EN_DIR=gs://
/squad1.1 export TPU_NAME= python run_squad.py \ --vocab_file=$BERT_BASE_DIR/vocab.txt \ --bert_config_file=$BERT_BASE_DIR/bert_config.json \ --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \ --do_train=True \ --train_file=$SQUAD_11_EN_DIR/train-v1.1.json \ --do_predict=True \ --predict_file=$SQUAD_11_EN_DIR/dev-v1.1.json \ --train_batch_size=8 \ --learning_rate=3e-5 \ --num_train_epochs=2.0 \ --max_seq_length=384 \ --doc_stride=128 \ --output_dir=gs://bert_deep_finder/output/ \ --use_tpu=True \ --tpu_name=$TPU_NAME So, should I upload the model into my bucket? I cannot use the one in gs://bert_models/2018_10_18/uncased_L-12_H-768_A-12?
Thanks
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/google-research/bert/issues/98#issuecomment-473465748, or mute the thread https://github.com/notifications/unsubscribe-auth/AUGROASzpjAGRvguHw5tmhRvUzaZYzVSks5vXCR9gaJpZM4YYYV8.
Sorry for asking, but then how can I use the models that are already online in the bert_models/
storage bucket? I suppose there must be a way since it's mentioned in the Fine-tuning with Cloud TPUs section of the repo.
Edit:
Could it be that my Cloud TPU is not in the same region as the bert_models/
bucket?
After running the following for about 5 minutes on a cloud based TPU, I get an error
Unsuccessful TensorSliceReader constructor: Failed to get matching files
The command is as follows:
python run_squad.py --vocab_file=$BERT_BASE_DIR/vocab.txt --bert_config_file=$BERT_BASE_DIR/bert_config.json --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt --do_train=True --train_file=$SQUAD_DIR/train-v1.1.json --do_predict=True --predict_file=$SQUAD_DIR/dev-v1.1.json --train_batch_size=24 --learning_rate=3e-5 --num_train_epochs=2.0 --max_seq_length=384 --doc_stride=128 --output_dir=gs://data_for_squad1/Squad1/ --use_tpu=True --tpu_name=$TPU_NAME
The BERT_BASE_DIR (./largebert) has the following files:
bert_config.json bert_model.ckpt.data-00000-of-00001 bert_model.ckpt.index bert_model.ckpt.meta vocab.txt
Here is the detailed Traceback:
self._traceback = tf_stack.extract_stack() _train_on_tpu_system(ctx, model_fn_wrapper, dequeue_fn)) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2830, in _train_on_tpu_system scaffold = _get_scaffold(captured_scaffold_fn) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2953, in _get_scaffold scaffold = scaffold_fn() File "run_squad.py", line 584, in tpu_scaffold tf.train.init_from_checkpoint(init_checkpoint, assignment_map) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/checkpoint_utils.py", line 187, in init_from_checkpoint _init_from_checkpoint, ckpt_dir_or_file, assignment_map) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/distribute.py", line 1053, in merge_call return self._merge_call(merge_fn, *args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/distribute.py", line 1061, in _merge_call return merge_fn(self._distribution_strategy, *args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/checkpoint_utils.py", line 231, in _init_from_checkpoint _set_variable_or_list_initializer(var, ckpt_file, tensor_name_in_ckpt) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/checkpoint_utils.py", line 355, in _set_variable_or_list_initializer _set_checkpoint_initializer(variable_or_list, ckpt_file, tensor_name, "") File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/checkpoint_utils.py", line 309, in _set_checkpoint_initializer ckpt_file, [tensor_name], [slice_spec], [base_type], name=name)[0] File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1466, in restore_v2 shape_and_slices=shape_and_slices, dtypes=dtypes, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func return func(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3274, in create_op op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1770, in __init__ self._traceback = tf_stack.extract_stack() InvalidArgumentError (see above for traceback): Unsuccessful TensorSliceReader constructor: Failed to get matching files on ./largebert/bert_model.ckpt: Unimplemented: File system scheme '[local]' not implemented (file: './largebert/bert_model.ckpt') [[node checkpoint_initializer_370 (defined at run_squad.py:584) = RestoreV2[dtypes=[DT_FLOAT], _device="/job:worker/replica:0/task:0/device:CPU:0"](checkpoin t_initializer/prefix, checkpoint_initializer_370/tensor_names, checkpoint_initializer/shape_and_slices)]]
Been trying to troubleshoot for a while, not sure where the problem lies. Any help would be appreciated.