huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.24k stars 2.69k forks source link

YAML integer keys are not preserved Hub server-side #5275

Closed albertvillanova closed 1 year ago

albertvillanova commented 1 year ago

After an internal discussion (https://github.com/huggingface/moon-landing/issues/4563):

On the other hand, at datasets we are currently using YAML integer keys for dataset_info class_label.

Please note (thanks @lhoestq for pointing out) that previous versions (2.6 and 2.7) of datasets need being patched:

In [18]: Features._from_yaml_list([{'dtype': {'class_label': {'names': {'0': 'neg', '1': 'pos'}}}, 'name': 'label'}])                                                                                      
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-18-974f07eea526> in <module>
----> 1 Features._from_yaml_list(ry)

~/Desktop/hf/nlp/src/datasets/features/features.py in _from_yaml_list(cls, yaml_data)
   1743                 raise TypeError(f"Expected a dict or a list but got {type(obj)}: {obj}")
   1744 
-> 1745         return cls.from_dict(from_yaml_inner(yaml_data))
   1746 
   1747     def encode_example(self, example):

~/Desktop/hf/nlp/src/datasets/features/features.py in from_yaml_inner(obj)
   1739             elif isinstance(obj, list):
   1740                 names = [_feature.pop("name") for _feature in obj]
-> 1741                 return {name: from_yaml_inner(_feature) for name, _feature in zip(names, obj)}
   1742             else:
   1743                 raise TypeError(f"Expected a dict or a list but got {type(obj)}: {obj}")

~/Desktop/hf/nlp/src/datasets/features/features.py in <dictcomp>(.0)
   1739             elif isinstance(obj, list):
   1740                 names = [_feature.pop("name") for _feature in obj]
-> 1741                 return {name: from_yaml_inner(_feature) for name, _feature in zip(names, obj)}
   1742             else:
   1743                 raise TypeError(f"Expected a dict or a list but got {type(obj)}: {obj}")

~/Desktop/hf/nlp/src/datasets/features/features.py in from_yaml_inner(obj)
   1734                             return {"_type": snakecase_to_camelcase(obj["dtype"])}
   1735                     else:
-> 1736                         return from_yaml_inner(obj["dtype"])
   1737                 else:
   1738                     return {"_type": snakecase_to_camelcase(_type), **unsimplify(obj)[_type]}

~/Desktop/hf/nlp/src/datasets/features/features.py in from_yaml_inner(obj)
   1736                         return from_yaml_inner(obj["dtype"])
   1737                 else:
-> 1738                     return {"_type": snakecase_to_camelcase(_type), **unsimplify(obj)[_type]}
   1739             elif isinstance(obj, list):
   1740                 names = [_feature.pop("name") for _feature in obj]

~/Desktop/hf/nlp/src/datasets/features/features.py in unsimplify(feature)
   1704             if isinstance(feature.get("class_label"), dict) and isinstance(feature["class_label"].get("names"), dict):
   1705                 label_ids = sorted(feature["class_label"]["names"])
-> 1706                 if label_ids and label_ids != list(range(label_ids[-1] + 1)):
   1707                     raise ValueError(
   1708                         f"ClassLabel expected a value for all label ids [0:{label_ids[-1] + 1}] but some ids are missing."

TypeError: can only concatenate str (not "int") to str

TODO:

albertvillanova commented 1 year ago

@huggingface/datasets if you agree, I can make the bulk edit on the Hub to fix integer keys into strings.

julien-c commented 1 year ago

Ok for me, and we can merge (internal) https://github.com/huggingface/moon-landing/pull/4609

lhoestq commented 1 year ago

FYI there are still 2k+ weekly users on datasets 2.6.1 which doesn't support the string label format for class labels. And among those, some are using datasets with class labels like imdb (60 users), conllpp (40), msra_ner (40), peoples_daily_enr (40), weibo_ner (30), conll2003 (20), etc. And renaming to string would break these users code.

julien-c commented 1 year ago

but isn't datasets 2.6.1 downloading files from the Hub with the corresponding tag? I thought we had something like this before

lhoestq commented 1 year ago

We're using main as models do. Some datasets need to be updated from time to time, e.g. when a link to download the data is dead.

But yea a year ago we had those tags, we just ended up not using them

lhoestq commented 1 year ago

I opened https://github.com/huggingface/datasets/issues/5406 to communicate on this. Let me know what you think, and if it sounds good to you I can pin this issue

albertvillanova commented 1 year ago

So, is it OK to make the bulk edit on the Hub now or should we wait longer? If the latter, how long?

lhoestq commented 1 year ago

I think we can do it. If you want to be extra cautious you can do it for all datasets except imdb and conllpp for now which are actively used by 2.6.1 users. For those two we can keep the YAML like this for some more time, or alternatively use the old dataset_infos.json file

albertvillanova commented 1 year ago

The bulk edit of canonical datasets (except imdb and conllpp) is running.

See e.g.: https://huggingface.co/datasets/acronym_identification/discussions/3

EDITED: Done, except for "universal_morphologies", where I get

HTTPError: 413 Client Error: Payload Too Large for url: https://huggingface.co/api/validate-yaml

Also not done for the datasets missing matadata "dataset_info":

lhoestq commented 1 year ago

Thank you !

albertvillanova commented 1 year ago

@lhoestq, there are 6 community datasets with YAML integer keys in their dataset_info class_label:

Maybe we could open a PR on them as well?

lhoestq commented 1 year ago

Let's do this then:

EDIT: all done :)

albertvillanova commented 1 year ago

@lhoestq I was not asking you to do it, but asking if you agree me to do it... :man_facepalming: As I self-assigned this issue... :sweat_smile: