googlecolab / colabtools

Python libraries for Google Colaboratory
Apache License 2.0
2.18k stars 707 forks source link

TPU Runtime can't access Google Cloud Storage #1056

Open jonyvp opened 4 years ago

jonyvp commented 4 years ago
train_filenames = tf.io.gfile.glob(os.path.join("gs://", config.bucket_name, config.tfrecord_path, 'train/*'))
train_ds = tf.data.TFRecordDataset(train_filenames, compression_type='GZIP')

amount_of_data = 0
for it in train_ds.__iter__():
    amount_of_data += 1

This yields a Permission error:

AttributeError                            Traceback (most recent call last)
/tensorflow-2.1.0/python3.6/tensorflow_core/python/data/ops/iterator_ops.py in _next_internal(self)
    662         # Fast path for the case `self._structure` is not a nested structure.
--> 663         return self._element_spec._from_compatible_tensor_list(ret)  # pylint: disable=protected-access
    664       except AttributeError:

AttributeError: 'tuple' object has no attribute '_from_compatible_tensor_list'

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
14 frames
RuntimeError: Error while creating shape

During handling of the above exception, another exception occurred:

PermissionDeniedError                     Traceback (most recent call last)
/tensorflow-2.1.0/python3.6/tensorflow_core/python/eager/executor.py in wait(self)
     65   def wait(self):
     66     """Waits for ops dispatched in this executor to finish."""
---> 67     pywrap_tensorflow.TFE_ExecutorWaitForAllPendingNodes(self._handle)
     68 
     69   def clear_error(self):

PermissionDeniedError: Error executing an HTTP request: HTTP response code 403 with body '{
  "error": {
    "code": 403,
    "message": "service-495559152420@cloud-tpu.iam.gserviceaccount.com does not have storage.objects.get access to bucketname/tfrecords/segment_tfrecords/train/20200304.",
    "errors": [
      {
        "message": "service-495559152420@cloud-tpu.iam.gserviceaccount.com does not have storage.objects.get access to bucketname/tfrecords/segment_tfrecords/train/20200304.",
        "domain": "global",
        "reason": "forbidden"
      }
    ]
  }
}
'
     when reading metadata of gs://bucketname/tfrecords/segment_tfrecords/train/20200304
jonyvp commented 4 years ago

It works though when I define my model (and thus TPU strategy) after loading the TFRecordDataset. i.e. first loading the dataset, then defining:

resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])
        tf.config.experimental_connect_to_cluster(resolver)
        tf.tpu.experimental.initialize_tpu_system(resolver)

        strategy = tf.distribute.experimental.TPUStrategy(resolver)

        with strategy.scope():
ashsny commented 4 years ago

I think we need a few more pieces of info here - is this a private bucket? are you using tensorflow-gcs-config? can you provide a self-contained repro notebook?

kechan commented 2 years ago

I ran into this on colab for TF 2.8.0. I am also trying to instantiate a tf dataset from a tfrecord stored on GCS.

Any idea what the root cause may be? This used to work for me 2-3 mths ago. I also tried what you did by defining AFTER loading dataset, and it seems to work.

I think this is a TPU related bug? I will next try with GPU and see if the same err happens.

Update: Although I can sanity test by iterating manually on the dataset, i got further error if i tried model.fit(...)

ahyunsoo3 commented 1 year ago

This is a security problem that GCS prevents anonymous from accessing the bucket. So, you must assign the right to the TPU to use the data. Check permission tap in your GCS bucket, add a new one, name the TPU service-495559152420@cloud-tpu.iam.gserviceaccount.com, and give it storage manager.