Hi. Searching for bcp47 tags that are just the language prefix (e.g. sq or da) excludes datasets that have added extra information in their language metadata (e.g. sq-AL or da-bornholm). The example codes given in the tagging app encourages addition of the additional codes ("expected format is BCP47 tags separated for ';' e.g. 'en-US;fr-FR'") but this would lead to those datasets being hidden in datasets search.
Note that they're missing when just the language is searched for (e.g. sq)
Some datasets are already affected by this - e.g. AmazonScience/massive is listed under sq-AL but not sq.
One workaround is for dataset creators to add an additional root language tag to dataset YAML metadata, but it's unclear how to communicate this. It might be possible to index the search on languagecode.split('-')[0] but I wanted to float this issue before trying to write any code :)
Expected results
Datasets using longer bcp47 tags also appear under searches for just the language code; e.g. Quebecois datasets (fr-CA) would come up when looking for French datasets with no region specification (fr), or US English (en-US) datasets would come up when searching for English datasets (en).
Actual results
The language codes seem to be directly string matched, excluding datasets with specific language tags from non-specific searches.
Describe the bug
Hi. Searching for bcp47 tags that are just the language prefix (e.g.
sq
orda
) excludes datasets that have added extra information in their language metadata (e.g.sq-AL
orda-bornholm
). The example codes given in the tagging app encourages addition of the additional codes ("expected format is BCP47 tags separated for ';' e.g. 'en-US;fr-FR'") but this would lead to those datasets being hidden in datasets search.Steps to reproduce the bug
sq-AL
)sq
)Some datasets are already affected by this - e.g.
AmazonScience/massive
is listed undersq-AL
but notsq
.One workaround is for dataset creators to add an additional root language tag to dataset YAML metadata, but it's unclear how to communicate this. It might be possible to index the search on
languagecode.split('-')[0]
but I wanted to float this issue before trying to write any code :)Expected results
Datasets using longer bcp47 tags also appear under searches for just the language code; e.g. Quebecois datasets (
fr-CA
) would come up when looking for French datasets with no region specification (fr
), or US English (en-US
) datasets would come up when searching for English datasets (en
).Actual results
The language codes seem to be directly string matched, excluding datasets with specific language tags from non-specific searches.
Environment info
(web app)