Open NikhilBartwal opened 3 years ago
@Conchylicultor I would like to work further upon this. It would be great if you could have a look! Thanks.
It seems that beside mnist and fashion mnist, none of the hugging faces datasets are using this https://github.com/huggingface/datasets/search?p=1&q=array2d As both datasets are already present in TFDS, we didn't think it was worth supporting this.
Hugging face compatibility layer is implemented in https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/community/huggingface_wrapper.py
Maybe we could raise a more explicit NotImplementedError ? Or add it if it's trivial to add.
As both datasets are already present in TFDS, we didn't think it was worth supporting this.
@Conchylicultor , Hi, this sound reasonable. But it got me wondering, that there can be several instances where a user is trying to load a CommunityDataset
, but the same dataset is already present in TFDS. Wouldn't it be reasonable to set up hierarchy with higher preference to datasets in TFDS (if not present then fallback to CDS, or even the reverse[CDS > TFDS], users may even forcefully override the hierarchy)?
Thanks.
Maybe we could raise a more explicit NotImplementedError ? Or add it if it's trivial to add.
@Conchylicultor I think we can do what @harsh020 mentioned i.e. giving a higher priority to the TFDS dataset if it is present in TFDS and HF ( or Community Datsets in general) both.
That way, even if a user is not aware of the same dataset being in TFDS as well, tfds.load
will automatically load the TFDS one and fall back to Community datasets if it's not present in TFDS.
We can add somethng like override_tfds
argument for tfds.load
for specifically loading the Community Datasets variant.
Also, we might need to add more cleaner missing dependency errors when importing community datasets, as outlined here: https://github.com/tensorflow/datasets/blob/ae21fae259ce6e5aa8beb10f18b6f4cff93de288/tensorflow_datasets/core/community/load.py#L48
If the user pass tfds.load('huggingface:my_dataset')
and TFDS returns tfds.load('my_dataset')
, it would be quite confusing for the user I believe. I expect that users will want to add and use their own variant of existing tfds datasets, so we should not update this.
At best, we could eventually log a message to indicate the dataset also exists in TFDS but I'm not even sure.
@Conchylicultor I think what you are saying makes sense.
So what do you think about mnist
and fashion_mnist
from HF? Since the Array2D
is not implemented, we can use the prioritization for these two datasets specifically. So that tfds.load('huggingface:mnist')
gives the TFDS mnist
dataset (along with a message maybe)
I expect that users will want to add and use their own variant of existing tfds datasets, so we should not update this.
I think that would avoid confusion and is better, at least for now, as we are still on experimentation for community ds.
@NikhilBartwal I don't mean to meddle here but I think @Conchylicultor is right, currently we just have one implementation of community dataset so stating something that would act generic even for the future datasets would be difficult as of now. I think we should simply log the availability of ds in TFDS (we can't even be sure that intercepting the error would be a good idea right now). Let me know what you think.
Thank you.
@harsh020 I agree with you that the user should be able to generate the dataset variant that they need.
As far as mnist
and fashion_mnist
datasets are concerned, we can either implement the HF's Array2D
feature or make tfds.load
use the TFDS dataset whichever is more efficient.
Seeing that the Array2D
is used only for these two datasets right now, but might be used in future HF datasets, we might need to implement it in TFDS at some point i guess.
@Conchylicultor Please let me know your thoughts on how should we proceed with this. Thanks!
Hi ! huggingface:mnist and huggingface:fashion_mnist don't use the Array2D feature anymore, but the Image feature now
Short description Loading the
huggingface:mnist
gives ArributeErrorEnvironment information
Operating System: Windows 10
Python version: 3.8
tensorflow-datasets
/tfds-nightly
version: 4.2.0+nightlytensorflow
/tf-nightly
version: 2.4.0Does the issue still exists with the last
tfds-nightly
package (pip install --upgrade tfds-nightly
) ? YesReproduction instructions
Expected behavior Since, the datasets library has a well defined
Array2D
method, so there souldn't have been any error and the dataset should have been generated seamlessly.Additional context This issue might be part of a larger problem with interfacing
huggingface
datasets with TFDS as there seems to be unusual errors with other HF datasets in TFDS as well likehuggingface:xnli
etc.