tensorflow / datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
https://www.tensorflow.org/datasets
Apache License 2.0
4.29k stars 1.54k forks source link

Cannot build hugging face datasets #5394

Closed ppham27 closed 3 weeks ago

ppham27 commented 5 months ago

Short description

$  tfds build huggingface:mnist/mnist

FileNotFoundError: Request failed for https://raw.githubusercontent.com/huggingface/datasets/master/datasets/mnist/dataset_infos.json
 Error: 404
 Reason: b'404: Not Found'

It seems the index (https://github.com/tensorflow/datasets/blob/751053fdb0f39cfc0d30797d3119b81306b91d5a/tensorflow_datasets/core/community/cache.py#L22) is out of date and hasn't been updated to use the hub: https://github.com/huggingface/datasets/pull/4059.

Environment information

Reproduction instructions

 tfds build huggingface:mnist/mnist

If you share a colab, make sure to update the permissions to share it.

Link to logs

INFO[config.py]: Loading namespace config from /usr/local/google/home/phillypham/venv/grain/lib/python3.11/site-packages/tensorflow_datasets/community-datasets.toml
Traceback (most recent call last):
  File "/usr/local/google/home/phillypham/venv/grain/bin/tfds", line 8, in <module>
    sys.exit(launch_cli())
             ^^^^^^^^^^^^
  File "/usr/local/google/home/phillypham/venv/grain/lib/python3.11/site-packages/tensorflow_datasets/scripts/cli/main.py", line 105, in launch_cli
    app.run(main, flags_parser=_parse_flags)
  File "/usr/local/google/home/phillypham/venv/grain/lib/python3.11/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/usr/local/google/home/phillypham/venv/grain/lib/python3.11/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
             ^^^^^^^^^^
  File "/usr/local/google/home/phillypham/venv/grain/lib/python3.11/site-packages/tensorflow_datasets/scripts/cli/main.py", line 100, in main
    args.subparser_fn(args)
  File "/usr/local/google/home/phillypham/venv/grain/lib/python3.11/site-packages/tensorflow_datasets/scripts/cli/build.py", line 302, in _build_datasets
    builders_cls_and_kwargs = [
                              ^
  File "/usr/local/google/home/phillypham/venv/grain/lib/python3.11/site-packages/tensorflow_datasets/scripts/cli/build.py", line 303, in <listcomp>
    _get_builder_cls_and_kwargs(dataset, has_imports=bool(args.imports))
  File "/usr/local/google/home/phillypham/venv/grain/lib/python3.11/site-packages/tensorflow_datasets/scripts/cli/build.py", line 420, in _get_builder_cls_and_kwargs
    builder_cls = tfds.builder_cls(str(name))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/contextlib.py", line 81, in inner
    return func(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/google/home/phillypham/venv/grain/lib/python3.11/site-packages/tensorflow_datasets/core/load.py", line 114, in builder_cls
    return community.community_register().builder_cls(ds_name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/google/home/phillypham/venv/grain/lib/python3.11/site-packages/tensorflow_datasets/core/community/registry.py", line 259, in builder_cls
    return registers[0].builder_cls(name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/google/home/phillypham/venv/grain/lib/python3.11/site-packages/tensorflow_datasets/core/community/register_package.py", line 249, in builder_cls
    installed_dataset = _download_or_reuse_cache(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/google/home/phillypham/venv/grain/lib/python3.11/site-packages/tensorflow_datasets/core/community/register_package.py", line 402, in _download_or_reuse_cache
    installed_package = _download_and_cache(package)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/google/home/phillypham/venv/grain/lib/python3.11/site-packages/tensorflow_datasets/core/community/register_package.py", line 449, in _download_and_cache
    dataset_sources_lib.download_from_source(
  File "/usr/local/google/home/phillypham/venv/grain/lib/python3.11/site-packages/tensorflow_datasets/core/community/dataset_sources.py", line 80, in download_from_source
    path.copy(dst / path.name)
  File "/usr/local/google/home/phillypham/venv/grain/lib/python3.11/site-packages/tensorflow_datasets/core/github_api/github_path.py", line 338, in copy
    dst.write_bytes(self.read_bytes())
                    ^^^^^^^^^^^^^^^^^
  File "/usr/local/google/home/phillypham/venv/grain/lib/python3.11/site-packages/tensorflow_datasets/core/github_api/github_path.py", line 311, in read_bytes
    return get_content(url)
           ^^^^^^^^^^^^^^^^
  File "/usr/local/google/home/phillypham/venv/grain/lib/python3.11/site-packages/tensorflow_datasets/core/github_api/github_path.py", line 44, in get_content
    raise FileNotFoundError(
FileNotFoundError: Request failed for https://raw.githubusercontent.com/huggingface/datasets/master/datasets/mnist/dataset_infos.json
 Error: 404
 Reason: b'404: Not Found'

Expected behavior

For it to work and call download_and_prepare.

Additional context

python -c "import tensorflow_datasets as tfds; tfds.builder('huggingface:mnist/mnist')"

works.

lbo462 commented 3 months ago

Have you tried replacing / with __ ?

If you're trying to work with mnist, you can pull it from the TensorFlow datasets catalog at https://www.tensorflow.org/datasets/catalog/overview :

python -c "import tensorflow_datasets as tfds; tfds.builder('mnist')" works as well.

If you do need to pull a dataset from HuggingFace, consider using tfds.load(), and replace / with __.

Hope this could help