tensorflow / datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
https://www.tensorflow.org/datasets
Apache License 2.0
4.24k stars 1.53k forks source link

Document how to use TFDS on Colab with TPU #486

Open danieljanes opened 5 years ago

danieljanes commented 5 years ago

What I need help with / What I was wondering When trying to use TFDS on Google Colab with TPU acceleration, there's the following exception:

UnimplementedError: File system scheme '[local]' not implemented

What I've tried so far From e.g. https://cloud.google.com/tpu/docs/quickstart one can see that TPUs expect data to be stored on GCS.

However there are examples using Keras+TPU on Colab which load data via tf.keras.datasets, such as: https://colab.research.google.com/gist/ceshine/f196d6b030adb1ec3a8d0b50642709dc/keras-fashion-mnist-tpu.ipynb

It would be nice if... ...there was documentation on how to use TFDS with Keras using TPU on Colab.

rsepassi commented 5 years ago

Good idea. Did you try setting data_dir to a GCS bucket?

On Fri, Apr 19, 2019 at 4:24 AM Daniel J. Beutel notifications@github.com wrote:

What I need help with / What I was wondering When trying to use TFDS on Google Colab with TPU acceleration, there's the following exception:

UnimplementedError: File system scheme '[local]' not implemented

What I've tried so far From e.g. https://cloud.google.com/tpu/docs/quickstart one can see that TPUs expect data to be stored on GCS.

However there are examples using Keras+TPU on Colab which load data via tf.keras.datasets, such as:

https://colab.research.google.com/gist/ceshine/f196d6b030adb1ec3a8d0b50642709dc/keras-fashion-mnist-tpu.ipynb

It would be nice if... ...there was documentation on how to use TFDS with Keras using TPU on Colab.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tensorflow/datasets/issues/486, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIQMWY3ZPJJLH2VBSBXHKLPRGTWXANCNFSM4HHDSHQQ .

danieljanes commented 5 years ago

Thanks @rsepassi, I created a GCS bucket and I'm passing the bucket identifier gs://... to TFDS using data_dir. It's returning a 401 though:

tensorflow.python.framework.errors_impl.PermissionDeniedError: Error executing an HTTP request: HTTP response code 401 with body '{
 "error": {
  "errors": [
   {
    "domain": "global",
    "reason": "required",
    "message": "Anonymous caller does not have storage.objects.get access to [...]/cifar10.",
    "locationType": "header",
    "location": "Authorization"
   }
  ],
  "code": 401,
  "message": "Anonymous caller does not have storage.objects.get access to [...]/cifar10."
 }
}
'
         when reading metadata of gs://ox-dnn-tpu/cifar10

I'd guess it's related to authentication and permissions on the GCS bucket, I'm not quite sure how to set these up in a way that TFDS can use it behind the scenes.

rsepassi commented 5 years ago

Thanks for this. The issue seems to be that the machine you're using doesn't have permissions to access the GCS bucket you created. Could you try following this https://cloud.google.com/tpu/docs/storage-buckets and see if it works?

On Tue, Apr 23, 2019 at 12:39 PM Daniel J. Beutel notifications@github.com wrote:

Thanks @rsepassi https://github.com/rsepassi, I created a GCS bucket and I'm passing the bucket identifier gs://... to TFDS using data_dir. It's returning a 401 though:

tensorflow.python.framework.errors_impl.PermissionDeniedError: Error executing an HTTP request: HTTP response code 401 with body '{ "error": { "errors": [ { "domain": "global", "reason": "required", "message": "Anonymous caller does not have storage.objects.get access to [...]/cifar10.", "locationType": "header", "location": "Authorization" } ], "code": 401, "message": "Anonymous caller does not have storage.objects.get access to [...]/cifar10." } } ' when reading metadata of gs://ox-dnn-tpu/cifar10

I'd guess it's related to authentication and permissions on the GCS bucket, I'm not quite sure how to set these up in a way that TFDS can use it behind the scenes.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/datasets/issues/486#issuecomment-485945123, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIQMW75NCFYU2MRIPTDIL3PR5QVLANCNFSM4HHDSHQQ .

danieljanes commented 5 years ago

@rsepassi this doc references a so-called project number:

https://cloud.google.com/tpu/docs/storage-buckets#locate_the_service_account

How can we see the project number for Colab? Also, is the project number stable accross different runs?

puneetjindal commented 4 years ago

I am running this code from my tf 2.1.0 docker on ubuntu 16.04 and my host machine directory is mounted inside the docker. Is there any way that i can use host volume mounted inside docker instead of GCS bucket? @rsepassi Any help that i can receive in this regard? There are 100000s of deeplearning beginners in Indian Universities who doesn't have access to GCS and they need to use it easily and beyond googlecolab limits. if this issue can be solved then it will help me publish content for their access.

ValleyZw commented 4 years ago

Hi, @danieljanes : I suffered these problems too, and here's solution of my case:

  1. when you facing the Anonymous caller problem, you need to get your account authenticated.

    from google.colab import auth
    auth.authenticate_user()
    1. when your account authenticated, you can test if you can get data from gcs:

      !gsutil ls gs://[BUCKET_NAME]
    2. when you can get data from gcs, you should enable TPU auth:

      gsutil acl ch -u [SERVICE_ACCOUNT]:READER gs://[BUCKET_NAME]
      gsutil acl ch -u [SERVICE_ACCOUNT]:WRITER gs://[BUCKET_NAME]

      where the SERVICE_ACCOUNT is service-[PROJECT_NUMBER]@cloud-tpu.iam.gserviceaccount.com

    3. The PROJECT_NUMBER for Colab is contained in the error message of the response when TPU is not authenticated, which should be:

      service-[PROJECT_NUMBER]@cloud-tpu.iam.gserviceaccount.com does not have storage.objects.get access to ....

      So, first run your TPU service without TPU auth, waiting for the error message from the response, then Enable TPU auth.

    4. In my case, I need to enable auth for every files in subfolders:

      gsutil acl ch -u [SERVICE_ACCOUNT]:READER gs://[BUCKET_NAME]/subfolder/*.tfrec

    Sincerely Hope this can help you.

danieljanes commented 4 years ago

Hi @ValleyZw , thanks for getting back to me about this! I'll give it a shot next time I work on Colab.

sayakpaul commented 4 years ago

I was able to execute on cloud v3 TPUs using local files. An example here: https://github.com/sayakpaul/Generating-categories-from-arXiv-paper-titles/blob/master/TPU_Experimentation.ipynb.

aigonna commented 2 years ago

Hi @ValleyZw : I can use gsutil like !gsutil ls gs://reu/data/ get:

gs://reu/data/
gs://reu/data/corpus.0.tfrecord
gs://reu/data/corpus.1.tfrecord

But I don't know why I can use python to read the file:

with open("gs://reu/data/corpus.0.tfrecord", 'r') as f:
    print(f)

And use in my code is fail。

corpus_paths = [
    f'gs://reu/data/corpus.{i}.tfrecord' for i in range(10)
]

Please help me! Thanks! And I find os.path is not exist: os.path.exists('gs://reu/data/') is False

Conchylicultor commented 2 years ago

You can use TFDS pathlib-like API which works with GCS paths:

path = tfds.core.as_path('gs://reu/data/corpus.0.tfrecord')
with path.open('rb') as f:
  pass

content = path.read_bytes()

assert path.exists()
assert path.name == 'corpus.0.tfrecord'

See https://docs.python.org/3/library/pathlib.html to learn more about pathlib.

aigonna commented 2 years ago

@Conchylicultor Thanks!

TylerADavis commented 1 year ago

Thanks @ValleyZw , your solution worked for me. However, I found that I had to restart the runtime after adding the authentication in order to get everything working properly.

Some sort of documentation or improved error messages would definitely be helpful, since it took a few Google searches to end up here and get a solution. I did try https://cloud.google.com/tpu/docs/storage-buckets , but it didn't resolve the 401 error for me. Maybe it just needed a restart though.