huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
18.97k stars 2.62k forks source link

Question: Shouldn't .info be a part of DatasetDict? #1687

Open KennethEnevoldsen opened 3 years ago

KennethEnevoldsen commented 3 years ago

Currently, only Dataset contains the .info or .features, but as many datasets contains standard splits (train, test) and thus the underlying information is the same (or at least should be) across the datasets.

For instance:

>>> ds = datasets.load_dataset("conll2002", "es")
>>> ds.info
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'DatasetDict' object has no attribute 'info'

I could imagine that this wouldn't work for datasets dicts which hold entirely different datasets (multimodal datasets), but it seems odd that splits of the same dataset is treated the same as what is essentially different datasets.

Intuitively it would also make sense that if a dataset is supplied via. the load_dataset that is have a common .info which covers the entire dataset.

It is entirely possible that I am missing another perspective

thomwolf commented 3 years ago

We could do something. There is a part of .info which is split specific (cache files, split instructions) but maybe if could be made to work.

KennethEnevoldsen commented 3 years ago

Yes this was kinda the idea I was going for. DatasetDict.info would be the shared info amongs the datasets (maybe even some info on how they differ).