huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.09k stars 2.65k forks source link

Align task tags in dataset metadata #5137

Closed albertvillanova closed 1 year ago

albertvillanova commented 1 year ago

Describe

Once we have agreed on a common naming for task tags for all open source projects, we should align on them.

Steps

lhoestq commented 1 year ago

I removed all the invalid task_ids in datasts without namespace, based on the (internal) types.ts

julien-c commented 1 year ago

(Types.ts is not internal it's public)

albertvillanova commented 1 year ago

I have opened PRs to fix the task_ids in all datasets within a namespace as well.

Working on task_categories...

albertvillanova commented 1 year ago

For future reference: this fix had some complications

When trying to open a PR to fix the task tags, an exception was thrown if:

Errors:

ValueError: - Error: "languages" is deprecated. Use "language" instead.
ValueError: - Error: "licenses" is deprecated. Use "license" instead.
ValueError: - Error: "language[17]" must only contain lowercase characters
ValueError: - Error: "language[0]" with value "cz, de, it" is not valid. It must be an ISO 639-1, 639-2 or 639-3 code (two/three letters), or a special value like "code", "multilingual". If you want to use BCP-47 identifiers, you can specify them in language_bcp47.
ValueError: - Error: "task_ids" must be an array
albertvillanova commented 1 year ago

All Hub datasets are done.

julien-c commented 1 year ago

great job! did you have feedback from Hub users/i.E. repo authors?

albertvillanova commented 1 year ago

Yes, @julien-c. These are some of the feedbacks:

NOTE: I'm editing this comment to add more feedback

dennlinger commented 1 year ago

As someone with feedback on the updates (which I highly appreciate seeing included here :D), a few comments from a "user perspective":

Thanks again for the streamlining process, I personally learned a fair bit about the tagging structure in the meantime! Best, Dennis

julien-c commented 1 year ago

Thanks to you both for your feedback! super useful! cc'ing @osanseviero too 🙂

The datasets explorer still shows tags that are no longer valid

wait which explorer is that? is it https://huggingface.co/datasets/viewer/ ?

dennlinger commented 1 year ago

Sorry, this one: https://huggingface.co/datasets
And then selecting the "Fine-Grained Tasks".

julien-c commented 1 year ago

good feedback! we'll improve this

osanseviero commented 1 year ago

Super useful feedback, thanks a lot!

albertvillanova commented 1 year ago
yoshitomo-matsubara commented 1 year ago

@albertvillanova Thank you for sharing our voice here!

Yes, we want symbolic-regression to be listed as a task. This task has been attracting attention from the machine learning/deep learning community, and unfortunately existing symbolic regression datasets are de-centralized in the community (hosted at individual platforms like author website, github, etc). It would be great for the community if Hugging Face can support the task.