Closed albertvillanova closed 1 year ago
I removed all the invalid task_ids in datasts without namespace, based on the (internal) types.ts
(Types.ts is not internal it's public)
I have opened PRs to fix the task_ids in all datasets within a namespace as well.
Working on task_categories...
For future reference: this fix had some complications
When trying to open a PR to fix the task tags, an exception was thrown if:
en-US
(instead of en
), no
(instead of 'no'
),...task_categories
or task_ids
was not an array (a dict for each config)Errors:
ValueError: - Error: "languages" is deprecated. Use "language" instead.
ValueError: - Error: "licenses" is deprecated. Use "license" instead.
ValueError: - Error: "language[17]" must only contain lowercase characters
ValueError: - Error: "language[0]" with value "cz, de, it" is not valid. It must be an ISO 639-1, 639-2 or 639-3 code (two/three letters), or a special value like "code", "multilingual". If you want to use BCP-47 identifiers, you can specify them in language_bcp47.
ValueError: - Error: "task_ids" must be an array
All Hub datasets are done.
great job! did you have feedback from Hub users/i.E. repo authors?
Yes, @julien-c. These are some of the feedbacks:
NOTE: I'm editing this comment to add more feedback
As someone with feedback on the updates (which I highly appreciate seeing included here :D), a few comments from a "user perspective":
summarization-other-paper-abstract-generation
, but also ones that should be task_categories
, such as summarization
). I'm assuming this will be fixed soon, but until then it can confuse people who don't understand why they suddenly can't use seemingly still valid tags anymore.task_categories
and task_ids
) would be super helpful. However, I think it would have been sufficient to just include some description in the dataset PRs where you can link to the Github/other discussion on the topic :) That way, I can check myself what changes are expected to happen.Thanks again for the streamlining process, I personally learned a fair bit about the tagging structure in the meantime! Best, Dennis
Thanks to you both for your feedback! super useful! cc'ing @osanseviero too 🙂
The datasets explorer still shows tags that are no longer valid
wait which explorer is that? is it https://huggingface.co/datasets/viewer/ ?
Sorry, this one: https://huggingface.co/datasets
And then selecting the "Fine-Grained Tasks".
good feedback! we'll improve this
Super useful feedback, thanks a lot!
@albertvillanova Thank you for sharing our voice here!
Yes, we want symbolic-regression
to be listed as a task. This task has been attracting attention from the machine learning/deep learning community, and unfortunately existing symbolic regression datasets are de-centralized in the community (hosted at individual platforms like author website, github, etc).
It would be great for the community if Hugging Face can support the task.
Describe
Once we have agreed on a common naming for task tags for all open source projects, we should align on them.
Steps