tensorflow / datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
https://www.tensorflow.org/datasets
Apache License 2.0
4.28k stars 1.53k forks source link

NotImplementedError: While importing/Loading tfds plant_leaves dataset #5416

Open Coolcoder45 opened 4 months ago

Coolcoder45 commented 4 months ago

/!\ PLEASE INCLUDE THE FULL STACKTRACE AND CODE SNIPPET

Short description tfds plant_leaves is not getting loaded successfully. It's throwing NotImplementedError. Tried on May 16, 2024

Environment information

Reproduction instructions

import tensorflow_datasets as tfds
plant_leaves = tfds.load('plant_leaves', split='train', shuffle_files=True)

Gives:

Downloading and preparing dataset 6.56 GiB (download: 6.56 GiB, generated: 6.81 GiB, total: 13.37 GiB) to /root/tensorflow_datasets/plant_leaves/0.1.1...
Dl Completed...: 100%
 1/1 [10:04<00:00, 604.39s/ url]
Dl Size...: 100%
 6718/6718 [10:04<00:00, 11.25 MiB/s]
Dataset plant_leaves downloaded and prepared to /root/tensorflow_datasets/plant_leaves/0.1.1. Subsequent calls will reuse this data.
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
[<ipython-input-3-d88d46497437>](https://localhost:8080/#) in <cell line: 2>()
      1 import tensorflow_datasets as tfds
----> 2 plant_leaves = tfds.load('plant_leaves', split='train', shuffle_files=True)

33 frames
[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/file_adapters.py](https://localhost:8080/#) in make_tf_data(cls, filename, buffer_size)
    206   ) -> tf.data.Dataset:
    207     """Returns TensorFlow Dataset comprising given array record file."""
--> 208     raise NotImplementedError(
    209         '`.as_dataset()` not implemented for ArrayRecord files. Please, use'
    210         ' `.as_data_source()`.'

NotImplementedError: `.as_dataset()` not implemented for ArrayRecord files. Please, use `.as_data_source()`.

Expected behavior To load dataset successfully.

pierrot0 commented 4 months ago

Hi, thank you for reporting! This is definitely a bug.

Workaround: add the following arg to your tfds.load call:

tfds.load(..., download_and_prepare_kwargs={'file_format': tfds.core.FileFormat.ARRAY_RECORD})

We'll look on how to update the code and update on the bug.

Coolcoder45 commented 4 months ago

It's still giving error.

import tensorflow_datasets as `tfds`
plant_leaves_data, plant_leaves_info = tfds.load('plant_leaves', split='train', shuffle_files=True, download_and_prepare_kwargs={'file_format': tfds.core.FileFormat.ARRAY_RECORD})

Gives

Downloading and preparing dataset 6.56 GiB (download: 6.56 GiB, generated: 6.81 GiB, total: 13.37 GiB) to /root/tensorflow_datasets/plant_leaves/0.1.1...
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[<ipython-input-3-608b46b22c6c>](https://localhost:8080/#) in <cell line: 4>()
      2 #plant_leaves = tfds.load('plant_leaves', split='train', shuffle_files=True)
      3 #plant_leaves_data, plant_leaves_info = tfds.load('plant_leaves', split='train', shuffle_files=True, as_data_source=True)
----> 4 plant_leaves_data, plant_leaves_info = tfds.load('plant_leaves', split='train', shuffle_files=True, download_and_prepare_kwargs={'file_format': tfds.core.FileFormat.ARRAY_RECORD})

5 frames
[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/logging/__init__.py](https://localhost:8080/#) in __call__(self, function, instance, args, kwargs)
    167     metadata = self._start_call()
    168     try:
--> 169       return function(*args, **kwargs)
    170     except Exception:
    171       metadata.mark_error()

[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/load.py](https://localhost:8080/#) in load(name, split, data_dir, batch_size, shuffle_files, download, as_supervised, decoders, read_config, with_info, builder_kwargs, download_and_prepare_kwargs, as_dataset_kwargs, try_gcs)
    645       try_gcs,
    646   )
--> 647   _download_and_prepare_builder(dbuilder, download, download_and_prepare_kwargs)
    648 
    649   if as_dataset_kwargs is None:

[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/load.py](https://localhost:8080/#) in _download_and_prepare_builder(dbuilder, download, download_and_prepare_kwargs)
    504   if download:
    505     download_and_prepare_kwargs = download_and_prepare_kwargs or {}
--> 506     dbuilder.download_and_prepare(**download_and_prepare_kwargs)
    507 
    508 

[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/logging/__init__.py](https://localhost:8080/#) in __call__(self, function, instance, args, kwargs)
    167     metadata = self._start_call()
    168     try:
--> 169       return function(*args, **kwargs)
    170     except Exception:
    171       metadata.mark_error()

[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/dataset_builder.py](https://localhost:8080/#) in download_and_prepare(self, download_dir, download_config, file_format)
    679     # to generate the files.
    680     if file_format:
--> 681       self.info.set_file_format(file_format, override=True)
    682 
    683     # Create a tmp dir and rename to self.data_dir on successful exit.

[/usr/local/lib/python3.10/dist-packages/tensorflow_datasets/core/dataset_info.py](https://localhost:8080/#) in set_file_format(self, file_format, override)
    470       )
    471     if override and self._fully_initialized:
--> 472       raise RuntimeError(
    473           "Cannot override the file format "
    474           "when the DatasetInfo is already fully initialized!"

RuntimeError: Cannot override the file format when the DatasetInfo is already fully initialized!
dddraxxx commented 2 months ago

Same errors on refcoco dataset. NotImplementedError: `.as_dataset()` not implemented for ArrayRecord files. Please, use `.as_data_source()`.

dddraxxx commented 2 months ago

Anyway, one thing I do to solve this is add the following line:

builder = tfds.builder('ref_coco/refcocog_umd')
builder.info.set_file_format(tfds.core.FileFormat.PARQUET, override=True, override_if_initialized=True)
builder.download_and_prepare()
ref_ds = tfds.load('ref_coco/refcocog_umd', split='validation')