huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.29k stars 2.7k forks source link

Language code search does direct matches #4304

Open leondz opened 2 years ago

leondz commented 2 years ago

Describe the bug

Hi. Searching for bcp47 tags that are just the language prefix (e.g. sq or da) excludes datasets that have added extra information in their language metadata (e.g. sq-AL or da-bornholm). The example codes given in the tagging app encourages addition of the additional codes ("expected format is BCP47 tags separated for ';' e.g. 'en-US;fr-FR'") but this would lead to those datasets being hidden in datasets search.

Steps to reproduce the bug

  1. Add a dataset using a variant tag (e.g. sq-AL)
  2. Look for datasets using the full code
  3. Note that they're missing when just the language is searched for (e.g. sq)

Some datasets are already affected by this - e.g. AmazonScience/massive is listed under sq-AL but not sq.

One workaround is for dataset creators to add an additional root language tag to dataset YAML metadata, but it's unclear how to communicate this. It might be possible to index the search on languagecode.split('-')[0] but I wanted to float this issue before trying to write any code :)

Expected results

Datasets using longer bcp47 tags also appear under searches for just the language code; e.g. Quebecois datasets (fr-CA) would come up when looking for French datasets with no region specification (fr), or US English (en-US) datasets would come up when searching for English datasets (en).

Actual results

The language codes seem to be directly string matched, excluding datasets with specific language tags from non-specific searches.

Environment info

(web app)

lhoestq commented 2 years ago

Thanks for reporting ! I forwarded the issue to the front-end team :)

Will keep you posted !

I also changed the tagging app to suggest two letters code for now.