tensorflow / datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
https://www.tensorflow.org/datasets
Apache License 2.0
4.23k stars 1.52k forks source link

'huggingface:mnist' dataset not generating #3148

Open NikhilBartwal opened 3 years ago

NikhilBartwal commented 3 years ago

Short description Loading the huggingface:mnist gives ArributeError

Environment information

Reproduction instructions

>>> nik = tfds.load('huggingface:mnist')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "f:\tfds\datasets\tensorflow_datasets\core\load.py", line 325, in load
    dbuilder = builder(name, data_dir=data_dir, try_gcs=try_gcs, **builder_kwargs)
  File "f:\tfds\datasets\tensorflow_datasets\core\load.py", line 172, in builder
    return cls(**builder_kwargs)  # pytype: disable=not-instantiable
  File "f:\tfds\datasets\tensorflow_datasets\core\dataset_builder.py", line 947, in __init__
    super().__init__(**kwargs)
  File "f:\tfds\datasets\tensorflow_datasets\core\dataset_builder.py", line 182, in __init__
    self.info.initialize_from_bucket()
  File "f:\tfds\datasets\tensorflow_datasets\core\utils\py_utils.py", line 149, in __get__
    cached = self.fget(obj)  # pytype: disable=attribute-error
  File "f:\tfds\datasets\tensorflow_datasets\core\community\huggingface_wrapper.py", line 124, in info
    info = self._info()
  File "C:\Users\nikhil\.cache\tensorflow_datasets\modules\tfds_community\huggingface\mnist\cf39692eaead13bb56fce5bcd2a1b230769d6801ce84feca3620e4201fdb8706\mnist.py", line 68, in _info
    "image": datasets.Array2D(shape=(28, 28), dtype="uint8"),
AttributeError: Failed to construct dataset huggingface:mnist: module 'datasets' has no attribute 'Array2D'

Expected behavior Since, the datasets library has a well defined Array2D method, so there souldn't have been any error and the dataset should have been generated seamlessly.

Additional context This issue might be part of a larger problem with interfacing huggingface datasets with TFDS as there seems to be unusual errors with other HF datasets in TFDS as well like huggingface:xnli etc.

NikhilBartwal commented 3 years ago

@Conchylicultor I would like to work further upon this. It would be great if you could have a look! Thanks.

Conchylicultor commented 3 years ago

It seems that beside mnist and fashion mnist, none of the hugging faces datasets are using this https://github.com/huggingface/datasets/search?p=1&q=array2d As both datasets are already present in TFDS, we didn't think it was worth supporting this.

Hugging face compatibility layer is implemented in https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/community/huggingface_wrapper.py

Maybe we could raise a more explicit NotImplementedError ? Or add it if it's trivial to add.

harsh020 commented 3 years ago

As both datasets are already present in TFDS, we didn't think it was worth supporting this.

@Conchylicultor , Hi, this sound reasonable. But it got me wondering, that there can be several instances where a user is trying to load a CommunityDataset, but the same dataset is already present in TFDS. Wouldn't it be reasonable to set up hierarchy with higher preference to datasets in TFDS (if not present then fallback to CDS, or even the reverse[CDS > TFDS], users may even forcefully override the hierarchy)?

Thanks.

NikhilBartwal commented 3 years ago

Maybe we could raise a more explicit NotImplementedError ? Or add it if it's trivial to add.

@Conchylicultor I think we can do what @harsh020 mentioned i.e. giving a higher priority to the TFDS dataset if it is present in TFDS and HF ( or Community Datsets in general) both. That way, even if a user is not aware of the same dataset being in TFDS as well, tfds.load will automatically load the TFDS one and fall back to Community datasets if it's not present in TFDS.

We can add somethng like override_tfds argument for tfds.load for specifically loading the Community Datasets variant.

Also, we might need to add more cleaner missing dependency errors when importing community datasets, as outlined here: https://github.com/tensorflow/datasets/blob/ae21fae259ce6e5aa8beb10f18b6f4cff93de288/tensorflow_datasets/core/community/load.py#L48

Conchylicultor commented 3 years ago

If the user pass tfds.load('huggingface:my_dataset') and TFDS returns tfds.load('my_dataset'), it would be quite confusing for the user I believe. I expect that users will want to add and use their own variant of existing tfds datasets, so we should not update this. At best, we could eventually log a message to indicate the dataset also exists in TFDS but I'm not even sure.

NikhilBartwal commented 3 years ago

@Conchylicultor I think what you are saying makes sense. So what do you think about mnist and fashion_mnist from HF? Since the Array2D is not implemented, we can use the prioritization for these two datasets specifically. So that tfds.load('huggingface:mnist') gives the TFDS mnist dataset (along with a message maybe)

harsh020 commented 3 years ago

I expect that users will want to add and use their own variant of existing tfds datasets, so we should not update this.

I think that would avoid confusion and is better, at least for now, as we are still on experimentation for community ds.

@NikhilBartwal I don't mean to meddle here but I think @Conchylicultor is right, currently we just have one implementation of community dataset so stating something that would act generic even for the future datasets would be difficult as of now. I think we should simply log the availability of ds in TFDS (we can't even be sure that intercepting the error would be a good idea right now). Let me know what you think.

Thank you.

NikhilBartwal commented 3 years ago

@harsh020 I agree with you that the user should be able to generate the dataset variant that they need. As far as mnist and fashion_mnist datasets are concerned, we can either implement the HF's Array2D feature or make tfds.load use the TFDS dataset whichever is more efficient.

Seeing that the Array2D is used only for these two datasets right now, but might be used in future HF datasets, we might need to implement it in TFDS at some point i guess.

@Conchylicultor Please let me know your thoughts on how should we proceed with this. Thanks!

lhoestq commented 2 years ago

Hi ! huggingface:mnist and huggingface:fashion_mnist don't use the Array2D feature anymore, but the Image feature now