Issue about languages tags

lbourdois commented 2 years ago

Hi,

A PR to discuss an issue raised in an exchange following a Hub PR : https://huggingface.co/datasets/AmazonScience/massive/discussions/1 (cc @julien-c).

The author of the dataset (the logic described below also applies to models) has filled language tags in the form "xx-XX" (e.g. fr-FR). However, the filters on the Datasets page (https://huggingface.co/datasets) only allow filtering by language tags of the form "xx" (e.g. "fr"). This meant that its dataset was not findable via the filters before the Hub PR added language tags of the form "xx".

This is something that happens quite frequently. A far from exhaustive example: • 59 "en-US" datasets not counted in the "en" (https://huggingface.co/datasets?languages=languages:en-US) • 24 "zh-CN" datasets not counted in the "zh" (https://huggingface.co/datasets?languages=languages:zh-CN) • 20 "sv-SE" datasets not counted in the "sv" (https://huggingface.co/datasets?languages=languages:sv-SE) • 15 "rm-sursilv" dataset not counted in the "rm" (https://huggingface.co/datasets?languages=languages:rm-sursilv) • 15 "ga-IE" dataset not counted in the "ga" (https://huggingface.co/datasets?languages=languages:ga-IE) • 13 "fy-NL" dataset not counted in the "fy" (https://huggingface.co/datasets?languages=languages:fy-NL) • 10 "pa-IN" dataset not counted in the "pa" (https://huggingface.co/datasets?languages=languages:pa-IN) • 5 "de-DE" dataset not counted in the "de" (https://huggingface.co/datasets?languages=languages:de-DE) Out of these 8 languages, 161 datasets are not findable via the language filter (actually a little less because there are overlapping multilingual datasets).

So I think it is important to have a way to find these datasets (or models). I don't know what the filter system is coded as, but if "en-" is present in the language tag (find via a regex) then we also count and return that dataset (or models) with the "en" tagged datasets should be feasible.

This would avoid having to submit PR Hubs to datasets/models that already have a language tag but are currently non-compliant and instead focus on datasets/models that have no language tag.

A small subtlety would also be to transform a language tag in ISO 639-2 or 3 to ISO 639-1 which is the convention currently used by HF. This would make it possible, for example, to find the following datasets https://huggingface.co/datasets?languages=languages:fra which are tagged in "fra" rather than "fr". Or if consider that it was necessary to create an ISO 639-3 because ISO 639-1 was deficient, it would be to do the reverse and thus convert the tags from ISO 639-1 to ISO 639-2 or 3 (https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes or https://iso639-3.sil.org/code_tables/639/data).

I don't know if you have an opinion on the subject.

Have a nice day,

julien-c commented 2 years ago

[...] to ISO 639-1 which is the convention currently used by HF

This is not strictly true if i remember correctly. See the original code at https://github.com/huggingface/widgets-server/blob/master/Language.ts

If I remember correctly, this list is a mix of ALL ISO 639-1, AND a subset of 639-2 or 639-3 codes that were in the original Helsinki-NLP models

lbourdois commented 2 years ago

The AI Coffee Break with Letitia video (https://www.youtube.com/watch?v=1gHUiNLYa20) made me realise that I had missed the Bapna & Caswell et al. paper from Google "Building Machine Translation Systems For The Next Thousand Languages" (https://arxiv.org/abs/2205.03983). It lists from page 57 to 77, over 1000 languages using the BCP-47 code. Wouldn't it be interesting to use this code in addition to the ISO 639-x codes to reference languages not covered by this format?

julien-c commented 2 years ago

Hi @lbourdois!

Quick notes:

we do want to deliberately limit the number of potential standards that we "accept", because we want to be able to filter models or datasets and otherwise it's very hard to do a good UX around this
to that effect, we are now (or will soon be) validating the use of ISO 639-x inside language metadata (for both models and datasets – models are already shipped and datasets will be shipped in the coming days)
repo authors can still put more details into a language_details field (which is an optional string)

BTW, I was not super precise with my last comment, but our intention is that language code recommendation on the hub would be to use ISO 639-1 if it exists, and fallback to ISO 639-2 or ISO 639-3 if it doesn't (cc @osanseviero )

lbourdois commented 2 years ago

Hi @julien-c

What you say makes complete sense and your last sentence is I think the most coherent and simple method.

Nevertheless, is it planned to have a place to submit a code for a language that would not be taken into account by the ISO-639-X format? To take French-speaking languages, Gallo (https://en.wikipedia.org/wiki/Gallo_language) which is spoken in Brittany does not have an ISO-639-X code if I refer to this list: https://iso639-3.sil.org/sites/iso639-3/files/downloads/iso-639-3.tab. Same for Bearnais (https://en.wikipedia.org/wiki/B%C3%A9arnese_dialect) or Normand (https://fr.wikipedia.org/wiki/Normand) or more generally, many of the regional languages in France with less than 200,000 speakers. Languages on which we could have data on the Hub via the work of French researchers, notably from the CNRS (https://atlas.limsi.fr/). The same logic can be found in Canada with, for example, Acadian (https://en.wikipedia.org/wiki/Acadian_French).

The methodology would then be: Please use ISO 639-1 code if it exists, and fallback to ISO 639-2 or ISO 639-3 if it doesn't. If your language is not available in the ISO-639-X codes, please check that it is not available in the following list [link to a list of codes that users would have previously proposed and that the Hugging Face team would have validated in order to have control over the user interface] and if not, you can propose a code for the language here [link to the place where one could submit a code].

The final method chosen should, I think, be added somewhere such as the https://huggingface.co/languages page so that it is visible to researchers so that they know which format to use to add their model or dataset.

osanseviero commented 2 years ago

cc @lhoestq for visibility.

please check that it is not available in the following list [link to a list of codes that users would have previously proposed and that the Hugging Face team would have validated in order to have control over the user interface]

In an ideal world I think this would be great, but from my experience most people will not search within an existing list :smile:. In any case, I think we could have an open discussion in a repo in which anyone could request us to add new languages and sure, we could keep https://hf.co/languages updated with it. Note we should probably do a nice revamp of https://hf.co/languages :) WDYT @julien-c?

lbourdois commented 2 years ago

In an ideal world I think this would be great, but from my experience most people will not search within an existing list 😄.

Note we should probably do a nice revamp of https://hf.co/languages :)

Concerning this point @osanseviero, I had opened a PR (https://github.com/huggingface/hub-docs/issues/194) at the same time as this one to put forward this idea (I had separated the two to have on one side the backend and on the other the frontend that could be implemented on a different timeframe). I had addressed rather visual points. Everything that reflects on adding documentation could be dealt with in this PR https://github.com/huggingface/hub-docs/issues/194.

julien-c commented 2 years ago

Note that after yet another round of discussion, we think we are also going to support an optional language_bcp47 which will be a array of string containing BCP 47 codes 🤯

They won't be exposed for filtering (at least in the short term) but at least we can now support the full BCP 47 codes in metadata if model/datasets authors have them (Hat/tip @lhoestq @yjernite)

lbourdois commented 2 years ago

Hi @julien-c When trying to add the languages to the https://huggingface.co/facebook/wav2vec2-xls-r-2b model (same logic for the 1B and 300M size models), I got the following warning:

What is the procedure in this case? Is it ok to indicate for example zh-HK by zh-hk?

The ISO codes that I enter are those provided by the authors in their publication (https://arxiv.org/abs/2111.09296) page 4.

julien-c commented 2 years ago

zh-HK is not a valid ISO 639-1, -2, or -3 code, but you can list it in language_bcp47:

language_bcp47:
- zh-HK

lbourdois commented 2 years ago

I did this and it worked fine for the YAML Metadata Error. However, it seems that the languages indicated in language_bcp47 are not taken into account by the Hub. Is this normal because it is not yet implemented or is it an anomaly?

Two observations lead me to believe that: 1) for the model wav2vec2-xls-r-2b, only 126 languages are indicated by the Hub where we should have 129 (the 128 languages of the model + multimodal). I have indicated 3 languages in language_bcp47 and I suppose that the 3 missing languages to reach 129 are those. 2) for the dataset massive, no languages are indicated by the Hub while they are well informed in language_bcp47 of the README file

julien-c commented 2 years ago

Yes, for hub filtering, only language (so ISO 639-1, -2, or -3) is supported.

We could display the language_bcp47 codes on the page, but for filtering, it would kinda defeat the standardization goal because then models with en-UK wouldn't appear when filtering for en, etc

lbourdois commented 1 year ago

Passing by here, I take the opportunity to close this issue. The discussions are more advanced in the one mentionned just before.

huggingface / hub-docs

Issue about languages tags #193