Language code search does direct matches

Describe the bug

Hi. Searching for bcp47 tags that are just the language prefix (e.g. sq or da) excludes datasets that have added extra information in their language metadata (e.g. sq-AL or da-bornholm). The example codes given in the tagging app encourages addition of the additional codes ("expected format is BCP47 tags separated for ';' e.g. 'en-US;fr-FR'") but this would lead to those datasets being hidden in datasets search.

Steps to reproduce the bug

Add a dataset using a variant tag (e.g. sq-AL)
Look for datasets using the full code
Note that they're missing when just the language is searched for (e.g. sq)

Some datasets are already affected by this - e.g. AmazonScience/massive is listed under sq-AL but not sq.

One workaround is for dataset creators to add an additional root language tag to dataset YAML metadata, but it's unclear how to communicate this. It might be possible to index the search on languagecode.split('-')[0] but I wanted to float this issue before trying to write any code :)

Expected results

Datasets using longer bcp47 tags also appear under searches for just the language code; e.g. Quebecois datasets (fr-CA) would come up when looking for French datasets with no region specification (fr), or US English (en-US) datasets would come up when searching for English datasets (en).

Actual results

The language codes seem to be directly string matched, excluding datasets with specific language tags from non-specific searches.

Environment info

(web app)

huggingface / datasets