Align task tags in dataset metadata

huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

https://huggingface.co/docs/datasets

Apache License 2.0

19.09k stars 2.65k forks source link

Align task tags in dataset metadata #5137

Closed albertvillanova closed 1 year ago

albertvillanova commented 1 year ago

Describe

Once we have agreed on a common naming for task tags for all open source projects, we should align on them.

Steps

[x] Align task tags in canonical datasets
- [x] task_categories: 4 datasets
- [x] task_ids (by @lhoestq)
[x] Open PRs in community datasets
- [x] task_categories: 451 datasets
- [x] task_ids: 556 datasets

lhoestq commented 1 year ago

I removed all the invalid task_ids in datasts without namespace, based on the ~~(internal)~~ types.ts

julien-c commented 1 year ago

(Types.ts is not internal it's public)

albertvillanova commented 1 year ago

I have opened PRs to fix the task_ids in all datasets within a namespace as well.

Working on task_categories...

albertvillanova commented 1 year ago

For future reference: this fix had some complications

When trying to open a PR to fix the task tags, an exception was thrown if:

the metadata contained "languages" or "licenses" (instead of "language" or "license")
the metadata contained a non-valid language: en-US (instead of en), no (instead of 'no'),...
the metadata contained a non-valid license
either task_categories or task_ids was not an array (a dict for each config)
the metadata contained non-valid tag names

Errors:

ValueError: - Error: "languages" is deprecated. Use "language" instead.

ValueError: - Error: "licenses" is deprecated. Use "license" instead.

ValueError: - Error: "language[17]" must only contain lowercase characters

ValueError: - Error: "language[0]" with value "cz, de, it" is not valid. It must be an ISO 639-1, 639-2 or 639-3 code (two/three letters), or a special value like "code", "multilingual". If you want to use BCP-47 identifiers, you can specify them in language_bcp47.

ValueError: - Error: "task_ids" must be an array

albertvillanova commented 1 year ago

All Hub datasets are done.

julien-c commented 1 year ago

great job! did you have feedback from Hub users/i.E. repo authors?

albertvillanova commented 1 year ago

Yes, @julien-c. These are some of the feedbacks:

Most people just thank for the fix: cahya/librivox-indonesia, TurkuNLP/xlsum-fi, coastalcph/fairlex
Why are we changing their task names? joelito/lextreme
- I take note of this for the next bulk operation; besides the PR title, we should also add a description to explain the reason for the change and also maybe putting a link to some pertinent GH Issue page
Some of them ask where to find the list of the supported task values is: dennlinger/klexikon, lmqg/qg_jaquad
- Currently, the list is here: https://github.com/huggingface/hub-docs/blob/main/js/src/lib/interfaces/Types.ts#L85
- Maybe we could made them more easily accessible
Some people do not agree about current "hierarchy":
- text-scoring: emrecan/nli_tr_for_simcse (but referring to emrecan/nli_tr_for_simcse)
- Before "text-scoring" was a task_category, with task_ids ["semantic-similarity-scoring", "sentiment-scoring"]
- Now all three are task_ids ["text-scoring", "semantic-similarity-scoring", "sentiment-scoring"] under the task_category "text-classification"
- People complain that their scoring tasks are not classification task
- binary-classification: why don't we have binary-classification? We have multi-class-classification, multi-label-classification and sentiment-classification, but not binary-classification
- symbolic-regression: yoshitomo-matsubara/srsd-feynman_hard, yoshitomo-matsubara/srsd-feynman_medium, yoshitomo-matsubara/srsd-feynman_easy
- Why don't we have symbolic-regression task?

NOTE: I'm editing this comment to add more feedback

dennlinger commented 1 year ago

As someone with feedback on the updates (which I highly appreciate seeing included here :D), a few comments from a "user perspective":

I think the general confusion for me was also surrounding the hierarchy; it doesn't really become super clear (even when using the tagger space) that one is a subset of the other, especially since it seems to be still possible to include fine-grained tasks without the "parent category"?
The datasets explorer still shows tags that are no longer valid (e.g., super specific ones such as summarization-other-paper-abstract-generation, but also ones that should be task_categories, such as summarization). I'm assuming this will be fixed soon, but until then it can confuse people who don't understand why they suddenly can't use seemingly still valid tags anymore.
As I mentioned to @albertvillanova, having a dedicated page in the docs with explanations (especially wrt the difference between task_categories and task_ids) would be super helpful. However, I think it would have been sufficient to just include some description in the dataset PRs where you can link to the Github/other discussion on the topic :) That way, I can check myself what changes are expected to happen.

Thanks again for the streamlining process, I personally learned a fair bit about the tagging structure in the meantime! Best, Dennis

julien-c commented 1 year ago

Thanks to you both for your feedback! super useful! cc'ing @osanseviero too 🙂

The datasets explorer still shows tags that are no longer valid

wait which explorer is that? is it https://huggingface.co/datasets/viewer/ ?

dennlinger commented 1 year ago

Sorry, this one: https://huggingface.co/datasets
And then selecting the "Fine-Grained Tasks".

julien-c commented 1 year ago

good feedback! we'll improve this

osanseviero commented 1 year ago

Super useful feedback, thanks a lot!

albertvillanova commented 1 year ago

Some people do not agree about current "hierarchy":
- symbolic-regression: yoshitomo-matsubara/srsd-feynman_hard, yoshitomo-matsubara/srsd-feynman_medium, yoshitomo-matsubara/srsd-feynman_easy
- Why don't we have symbolic-regression task?

yoshitomo-matsubara commented 1 year ago

@albertvillanova Thank you for sharing our voice here!

Yes, we want symbolic-regression to be listed as a task. This task has been attracting attention from the machine learning/deep learning community, and unfortunately existing symbolic regression datasets are de-centralized in the community (hosted at individual platforms like author website, github, etc). It would be great for the community if Hugging Face can support the task.