Closed albertvillanova closed 1 year ago
@huggingface/datasets if you agree, I can make the bulk edit on the Hub to fix integer keys into strings.
Ok for me, and we can merge (internal) https://github.com/huggingface/moon-landing/pull/4609
FYI there are still 2k+ weekly users on datasets
2.6.1 which doesn't support the string label format for class labels. And among those, some are using datasets with class labels like imdb (60 users), conllpp (40), msra_ner (40), peoples_daily_enr (40), weibo_ner (30), conll2003 (20), etc. And renaming to string would break these users code.
but isn't datasets 2.6.1
downloading files from the Hub with the corresponding tag? I thought we had something like this before
We're using main
as models do. Some datasets need to be updated from time to time, e.g. when a link to download the data is dead.
But yea a year ago we had those tags, we just ended up not using them
I opened https://github.com/huggingface/datasets/issues/5406 to communicate on this. Let me know what you think, and if it sounds good to you I can pin this issue
So, is it OK to make the bulk edit on the Hub now or should we wait longer? If the latter, how long?
I think we can do it. If you want to be extra cautious you can do it for all datasets except imdb and conllpp for now which are actively used by 2.6.1 users. For those two we can keep the YAML like this for some more time, or alternatively use the old dataset_infos.json file
The bulk edit of canonical datasets (except imdb and conllpp) is running.
See e.g.: https://huggingface.co/datasets/acronym_identification/discussions/3
EDITED: Done, except for "universal_morphologies", where I get
HTTPError: 413 Client Error: Payload Too Large for url: https://huggingface.co/api/validate-yaml
Also not done for the datasets missing matadata "dataset_info":
Thank you !
@lhoestq, there are 6 community datasets with YAML integer keys in their dataset_info
class_label
:
Maybe we could open a PR on them as well?
Let's do this then:
EDIT: all done :)
@lhoestq I was not asking you to do it, but asking if you agree me to do it... :man_facepalming: As I self-assigned this issue... :sweat_smile:
After an internal discussion (https://github.com/huggingface/moon-landing/issues/4563):
transformers
modelsid2label
: https://huggingface.co/roberta-large-mnli/blob/main/config.jsonOn the other hand, at
datasets
we are currently using YAML integer keys fordataset_info
class_label
.Please note (thanks @lhoestq for pointing out) that previous versions (2.6 and 2.7) of
datasets
need being patched:TODO:
dataset_info
metadatadatasets
versions: 2.6 and 2.7