huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
18.73k stars 2.59k forks source link

Language names and language codes: connecting to a big database (rather than slow enrichment of custom list) #4881

Open alexis-michaud opened 1 year ago

alexis-michaud commented 1 year ago

The problem: Language diversity is an important dimension of the diversity of datasets. To find one's way around datasets, being able to search by language name and by standardized codes appears crucial.

Currently the list of language codes is here, right? At about 1,500 entries, it is roughly at 1/4th of the world's diversity of extant languages. (Probably less, as the list of 1,418 contains variants that are linguistically very close: 108 varieties of English, for instance.)

Looking forward to ever increasing coverage, how will the list of language names and language codes improve over time? Enrichment of the custom list by HFT contributors (like here) has several issues:

A solution that seems desirable: Connecting to an established database that (i) aims at full coverage of the world's languages and (ii) has information on higher-level groupings, alternative names, etc. It takes a lot of hard work to do such databases. Two important initiatives are Ethnologue (ISO standard) and Glottolog. Both have pros and cons. Glottolog contains references to Ethnologue identifiers, so adopting Glottolog entails getting the advantages of both sets of language codes.

Both seem technically accessible & 'developer-friendly'. Glottolog has a GitHub repo. For Ethnologue, harvesting tools have been devised (see here; I did not try it out).

In case a conversation with linguists seemed in order here, I'd be happy to participate ('pro bono', of course), & to rustle up more colleagues as useful, to help this useful development happen. With appreciation of HFT,

albertvillanova commented 1 year ago

Thanks for opening this discussion, @alexis-michaud.

As the language validation procedure is shared with other Hugging Face projects, I'm tagging them as well.

CC: @huggingface/moon-landing

julien-c commented 1 year ago

on the Hub side, there is not fine grained validation we just check that language: contains an array of lowercase strings between 2 and 3 characters long =)

and for language_bcp47: we just check it's an array of strings.

The only page where we have a hardcoded list of languages is https://huggingface.co/languages and I've been thinking of hooking that page on an external database of languages (so any suggestion is super interesting), but it's not used for validation.

That being said, in datasets this file https://github.com/huggingface/datasets/blob/main/src/datasets/utils/resources/languages.json is not really used no? Or just in the tagging tool? What about just removing it?

also cc'ing @lbourdois who's been active and helpful on those subjects in the past!

julien-c commented 1 year ago

PS @alexis-michaud is there a DB of language codes you would recommend? That would contain all ISO 639-1, 639-2 or 639-3 codes and be kept up to date, and ideally that would be accessible as a Node.js npm package?

cc @albertvillanova too

alexis-michaud commented 1 year ago

PS @alexis-michaud is there a DB of language codes you would recommend? That would contain all ISO 639-1, 639-2 or 639-3 codes and be kept up to date, and ideally that would be accessible as a Node.js npm package?

cc @albertvillanova too

Many thanks for your answer!

The Glottolog database is kept up to date, and has information on the closest ISO code for each Glottocode. So providing a clean table with equivalences sounds (to me) like something perfectly reasonable to expect from their team. To what extent would pyglottolog fit the bill / do the job? (API documentation here) I'm reaching my technical limitations here: I can't assess the distance between what they offer and what the HF team needs. I have opened an Issue in their repo.

Very interested to see where it goes from there.

BenjaminGalliot commented 1 year ago

I just tried pyglottolog to generate a file with all the current IDs (first column).

glottolog languoids inside the glottolog repository.

glottolog-languoids-v4.6-10-g5c66eec874.csv

HughP commented 1 year ago

Greetings @alexis-michaud and others, I think perhaps a standards-based approach here would help everyone out both at the technical and social layers of technical innovations.

Let me say a few things:

  1. there are multiple kinds of assets in AI that should have associated language codes.
    • AI Training Data sets
    • AI models
    • AI outputs These are all distinct components which should be tagged for the language and encoding methods they operate on or enhance. For example, an AI based cross-language tool from French to English (UK variety) still needs to consider if it is operating on oral language speech or written text. This is where IANA language sub-tags come in and are so important. I link to the official source. If one wants to use middleware such as a python package or npm package to manage strings then please make sure those packages are updating codes as they are being revised. I see that @julien-c mentioned BCP-47. BCP-47 is the current standard for language tagging. Following it will make the resources you create more findable and let future users better understand or expect any biases which may have been introduced in the different AI based products.
  2. BCP-47 is a technical read. However, you will notice that it identifies when to use an ISO 639-1, ISO 639-2, or ISO 639-3. code. This is important for interoperability with many systems. If you are using library systems then you should likely just stick with ISO 639-3 codes.
  3. If you are going to use Glottolog codes use them after an -x- tag in the BCP-47 format to maintain BCP-47 validity.
  4. You should source ISO 639-3 codes directly from the ISO 639-3 registrar as these codes are updated annually, usually in February or March. ISO 639-3 codes have multiple classes: Active, Deprecated, and Unassigned. This means that string length checking is not a sufficient strategy for validation.
  5. The names of smaller languages often change depending on the language used to describe them. The ISO639-2 documentation has a list of language names for languages with smaller populations for languages in which descriptions about these languages are often written. For example, ISO 639-2's documentation contains the names of languages as they are used in French, German, and English. ISO 639-2 rarely is updated as it is now tied to ISO 639-3's evolution and modern systems should just use ISO 639-3, but these additional names of languages in other languages may not appear in the ISO 369-3 tables.
  6. Glottolog codes are also updated at least annually. Usually sometime after ISO 639-3 updates.
  7. Please, if the material is in a written mode, please indicate which script is used unless the IANA field has a suppress script value. Please use the script tag that BCP-47 calls for from ISO 15924. This also updates at least annually.
  8. Another great place to look for language names is the Unicode CLDR database for locales. These ought to be congruent with ISO 639-3 but, sometimes CLDR has additional references to languages (such as the french name for a language) which is not contained in ISO 639-2 or ISO 639-3.
  9. Wikidata for language names is not always a great source of authoritative information. Language names are asymmetrical. Many times they are contrived because there is no actual name for the language in the language referring... e.g. French doesn't have a name for every language in the world, often they say something like: the language of 'x' people. — English does the same. When a language name standard does not have the best name for a language the best way to handle that is to make a change request with the standards registrar. Keeping track of the source list and the version of your source list for your language codes is very important.
  10. Finally, It would be a great service to technologist, minority language communities, and linguists if for all resources of the three types mentioned in number 1 above you added a record to OLAC. — I can help you with that. OLAC is a search interface for language resources.
lbourdois commented 1 year ago

Hi everybody!

About the point:

also cc'ing @lbourdois who's been active and helpful on those subjects in the past!

Discussions on the need to improve the Hub's tagging system (applying to both datasets and models) can be found in the following discussion: https://github.com/huggingface/hub-docs/issues/193 Once this system has been redone and satisfies the identified needs, a redesign of the Languages page would also be relevant: https://github.com/huggingface/hub-docs/issues/194. I invite you to read them. But as a quick summary, the exchanges were oriented towards the ISO standard (the first HF system was based on it and it is generally the standard indicated in AI/DL papers) by favouring ISO 639-1 if it exists, and fallback to ISO 639-2 or ISO 639-3 if it doesn't. In addition, it is possible to add BCP-47 tags to consider existing varieties/regionalisms within a language (https://huggingface.co/datasets/AmazonScience/massive/discussions/1). If a language does not belong to either of these two standards, then a request should be made to the HF team to add it manually.

To return to the present discussion, thank you for the various databases and methodologies you mention. It makes a big difference to have linguists in the loop 🚀.

I have a couple of questions where I think an expert perspective would be appreciated:

HughP commented 1 year ago

I invite you to read them. But as a quick summary, the exchanges were oriented towards the ISO standard (the first HF system was based on it and it is generally the standard indicated in AI/DL papers) by favouring ISO 639-1 if it exists, and fallback to ISO 639-2 or ISO 639-3 if it doesn't. In addition, it is possible to add BCP-47 tags to consider existing varieties/regionalisms within a language (https://huggingface.co/datasets/AmazonScience/massive/discussions/1). If a language does not belong to either of these two standards, then a request should be made to the HF team to add it manually.

One comment on this fall back system (which generally follows the BCP-47 process). ISO 639-2 has some codes which refer to a language ambiguously. For example, I believe code ara is used for arabic. In some contexts arabic is considered a single language, however, Egyptian Arabic is quite different from Moroccan Arabic, which are both considered separate languages. These ambiguous codes are valid ISO 639-3 codes but they have a special status. They are called macro codes. They exist inside the ISO 639-3 standard to provide absolute fallback compatibility between ISO 639-2 and ISO 639-3. However, when considering AI and MT applications with language data, the unforeseen potential applications and the potential for bias using macro codes should be avoided for new applications of language tags to resources. For historical cases where it is not clear what resources were used to create the AI tools or datasets then I understand the use of ambiguous tag uses. So for clarity in language tagging I suggest:

  1. Strictly following BCP-47
  2. Whenever possible avoid the use of macro tags in the ISO 639-3 standard. These are BCP-47 valid, but could introduce biases in the application of their use in society. (Generally there are more specific tags available to use in the ISO 639-3 standard.)
HughP commented 1 year ago
  • Are there any databases that take into account all the existing sign languages in the world? It would be nice to have them included on the Hub.

Sign Languages present an interesting case. As I understand the situation. The identification of sign languages has been identified as a component of their endangerment. Some sign languages do exist in ISO 639-3. For further discussion on the issue I refer readers to the following publications:

One way to be BCP-47 compliant and identify a sign language which is not identified in any of the BCP-47 referenced standards is to use the ISO 639-3 code for undetermined language und and then apply a custom suffix indicator (as explained in BCP-47) -x- and a custom code, such as the ones used in https://doi.org/10.3390/languages7010049

HughP commented 1 year ago
  • Is there an international classification of languages? A bit like the International Classification of Diseases in medicine, which is established by the WHO and used as a reference throughout the world. The idea would be to have a precise number of languages to which we would then have to assign a unique tag in order to find them later.

Yes that would be the function of ISO 639-3. It is the reference standard for languages. It includes a code and its name and the status of the code. Many technical metadata standards for file and computer interoperability reference it, many technical library metadata standards reference it. Some linguists use it. Many governments reference it.

Indexing diseases are different from indexing languages in several ways, one way is that diseases are the impact of a pathogen not the pathogen itself. If we take COVID-19 as an example, there are many varieties of the pathogen but broadly speaking there is only one disease — with many symptoms.

HughP commented 1 year ago
  • When you look up a language on Wikipedia, it usually shows, in addition to the ISO standard, the codes in the Glottolog (which you have already mentioned), ELP and Linguasphere databases. Would you have any opinion about these two other databases?

While these do appear on wikipedia, I don't know of any information system which uses these codes. I do know that glottolog did import ELP data at one time and its database does contain ELP data I'm not sure if Glottolog regularly ingests new versions of ELP data. I suspect that the use of Linguasphere data may be relevant to users of wikidata as a linked data attribute but I haven't heard of any linked data projects using Linguasphere data for analysis or product development. My impression is that it is fairly unused.

HughP commented 1 year ago
  • Do you think it's possible to easily handle tags that have been deprecated potentially for decades? For example (I'm taking the case of Hebrew but this has happened for other languages) I tagged Google models with the "iw" tag because I based it on what the authors gave in their paper see table 6 page 12). It turns out that this ISO tag has in fact been deprecated since 1989 in favour of the "he" tag. It would therefore be necessary to have a verification that transforms the old tags into the most recent ones.

Yes. You can parse the IANA file linked to above (it is regularly updated). All deprecated tags are marked as such in that file. The new prefered tag if there is one, is indicated. ISO 639-3 also indicates a code's status but their list is relevant only codes within their domain (ISO 639-3).

HughP commented 1 year ago

I would interpret en-fr as english as spoken in France. frin this position refers to the geo-political entity not a second language. I see no reason that other linguists should have a different option after having read BCP-47 and understood how it works.

The functional goal here is to tag a language resource as being produced by nonnative speakers, while tagging both languages. There are several problems here. The first is that BCP-47 has no way explicit way to do this. One could use the sub code x- with a private use code to indicate a second language and infer some meaning as to that language's role. However, there is another problem here which complexifies the situation greatly... how do we know that those english speakers (in France, or from France, or who were native French speakers) were not speaking their third or fourth language rather than their second language. So to conceptualize a sub-tag which indicates the first language of a speech act for speakers in a second (or other) language would need to be carefully crafted. It might then be proposed to the appropriate authorities. For example three sub-tags exist.

There are three registered sub-tags out of a BCP-47 allowed 35. These are x-, u-, and t-. u- and t- are defined in RFC6067 and RFC6497 . For more information see the Unicode CLDR documentation where it says:

IETF BCP 47 Tags for Identifying Languages defines the language identifiers (tags) used on the Internet and in many standards. It has an extension mechanism that allows additional information to be included. The Unicode Consortium is the maintainer of the extension ‘u’ for Locale Extensions, as described in rfc6067, and the extension 't' for Transformed Content, as described in rfc6497.

The subtags available for use in the 'u' extension provide language tag extensions that provide for additional information needed for identifying locales. The 'u' subtags consist of a set of keys and associated values (types). For example, a locale identifier for British English with numeric collation has the following form: en-GB-u-kn-true

The subtags available for use in the 't' extension provide language tag extensions that provide for additional information needed for identifying transformed content, or a request to transform content in a certain way. For example, the language tag "ja-Kana-t-it" can be used as a content tag indicates Japanese Katakana transformed from Italian. It can also be used as a request for a given transformation.

For more details on the valid subtags for these extensions, their syntax, and their meanings, see LDML Section 3.7 Unicode BCP 47 Extension Data.

alexis-michaud commented 1 year ago

Hi @lbourdois ! Many thanks for the detailed information.

Discussions on the need to improve the Hub's tagging system (applying to both datasets and models) can be found in the following discussion: huggingface/hub-docs#193 Fascinating topic! To me, the following suggestion has a lot of appeal: "if consider that it was necessary to create an ISO 639-3 because ISO 639-1 was deficient, it would be to do the reverse and thus convert the tags from ISO 639-1 to ISO 639-2 or 3 (https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes or https://iso639-3.sil.org/code_tables/639/data)."

Yes, ISO 639-1 is unsuitable because it has so few codes: less than 200. To address linguistic diversity in 'unrestricted mode', a list of all languages is wanted.

The idea of letting people use their favourite nomenclature and automatically adding the ISO 639-3 three-letter code as a tag is appealing. Thus all the HF datasets would have three-letter language tags (handy for basic search), alongside the authors' preferred tags and language names (including Glottolog tags as well as ISO 639-{1, 2}, and all other options allowed by BCP-47).

Retaining the authors' original tags and language names would be best.

Thus there would be a BCP-47 tag (sounds like a solid technical choice, though not 'passer-by-friendly': requiring some expertise to interpret) plus an ISO 639-3 tag that could be grabbed easily, and (last but not least) language names spelled out in full. Searches would be easier. No information would be lost.

Are industry practices so conservative that many people are happy with two-letter codes, and consider ISO 639-3 three-letter codes an unnecessary complication? That would be a pity, since there are so many advantages to using longer lists. (Somewhat like the transition to Unicode: sooo much better!) But maybe that conservative attitude is widespread, and it would then need to be taken into account. In which case, one could consider offering two-letter codes as a search option. Internally, the search engine would look up the corresponding 3-letter codes, and produce the search results accordingly.

Now to the other questions:

  • Do you think it's possible to easily handle tags that have been deprecated potentially for decades? For example (I'm taking the case of Hebrew but this has happened for other languages) I tagged Google models with the "iw" tag because I based it on what the authors gave in their paper see table 6 page 12). It turns out that this ISO tag has in fact been deprecated since 1989 in favour of the "he" tag. It would therefore be necessary to have a verification that transforms the old tags into the most recent ones. I guess that the above suggestion takes care of this case. The original tag (in this example, "iw") is retained (facilitating cross-reference with the published paper, and respecting the real: the way the dataset was originally tagged). This old tag goes into the BCP-47 field of the dataset, which can handle quirks & oddities like this one. And a new tag is added in the ISO 639-3 field: the 3-letter code "heb".

  • When you look up a language on Wikipedia, it usually shows, in addition to the ISO standard, the codes in the Glottolog (which you have already mentioned), ELP and Linguasphere databases. Would you have any opinion about these two other databases?

I'm afraid I never heard about Linguasphere. The online register for Linguasphere (PDF) seems to be from 1999-2000. It seems that the level of interoperability is not very high right now. (By contrast, Glottolog has pyglottolog and in my experience contacts flow well.)

The Endangered Languages Project is something Google started but initially did not 'push' very strongly, it seems. Just airing an opinion on the public Internet, it seems that the project is now solidly rooted at University of Hawaiʻi at Mānoa. It seems that they do not generate codes of their own. They refer to ISO 639-3 (Ethnologue) as a code authority when applicable, and otherwise provide comments in so many words, such as that language L currently lacks an Ethnologue code of its own (example here).

  • On the Hub, there is the following dataset where French people speak in English: https://huggingface.co/datasets/Datatang/French_Speaking_English_Speech_Data_by_Mobile_Phone Is there a database to take this case into account? I have not found any code in the Glottolog database. If based on an IETF BCP-47 standard, I would tend to tag the dataset with "en-fr" but would this be something accepted by linguists? Based on the first post in this thread that there are about 8000 languages, if one considers that a given language can be pronounced by a speaker of the other 7999, that would theoretically make about 64 million BCP-47 language1-language2 codes existing. And even much more if we consider regionalists with language1_regionalism_x-language2_regionalism_y. I guess there is no such database.

Yes, you noted the difficulty here: that there are so many possible situations. Eventually, each dataset would required descriptors of its own. @BenjaminGalliot points out that, in addition to specifying the speakers' native languages, the degree of language proficiency would also be relevant. How many years did the speakers spend in which area? Talking which languages? In what chronological order? Etc. The complexity defies encoding. The purpose of language codes is to allow for searches that group resources into sets that make sense. Additional information is very important, but would seem to be a matter for 'comments' fields.

  • Is there an international classification of languages? A bit like the International Classification of Diseases in medicine, which is established by the WHO and used as a reference throughout the world. The idea would be to have a precise number of languages to which we would then have to assign a unique tag in order to find them later.

As I understand, Ethnologue and Glottolog both try to do that, each in its own way. The simile with diseases seems interesting, to some extent: in both cases it's about human classification of phenomena that have complexity (though some diseases are simpler than others, whereas all languages have much complexity, in different ways).

  • Finally, when can we expect to see all the datasets of Pangloss on HF? eyes And I don't know if you have a way to help to add also the datasets of CoCoON.

Three concerns: (i) Technical specifications: we have not yet received feedback on the Japhug and Na datasets in HF. There may be technical considerations that we have not yet thought of and that would need to be taken into account before 'bulk upload'. (ii) Would there be a way to automate the process? The way @BenjaminGalliot did it for Japhug and Na, there was a manual component involved, and doing it by hand for all 200 datasets would not be an ideal workflow, given that the metadata are all clearly arranged. (iii) Some datasets are currently under a 'No derivatives' CreativeCommons license. We could go back to the depositors and argue that the 'No derivatives' mention were best omitted (see here a similar argument about publications): again, we'd want to be sure about the way forward before we set the process into motion.

Our hope would be that some colleagues try out the OutilsPangloss download tool, assemble datasets from Pangloss/Cocoon as they wish, then deposit them to HF.

HughP commented 1 year ago

The idea of letting people use their favourite nomenclature and automatically adding the ISO 639-3 three-letter code as a tag is appealing. Thus all the HF datasets would have three-letter language tags (handy for basic search), alongside the authors' preferred tags and language names (including Glottolog tags as well as ISO 639-{1, 2}, and all other options allowed by BCP-47).

Retaining the authors' original tags and language names would be best.

  • For language names: some people favour one name over another and it is important to respect their choice. In the case of Yongning Na: alternative names include 'Mosuo', 'Narua', 'Eastern Naxi'... and the names carry implications: people have been reported to come to blows about the use of the term 'Mosuo'.
  • For language tags: Glottocodes can be more fine-grained than Ethnologue (ISO 639-3), and some colleagues feel strongly about those.

Thus there would be a BCP-47 tag (sounds like a solid technical choice, though not 'passer-by-friendly': requiring some expertise to interpret) plus an ISO 639-3 tag that could be grabbed easily, and (last but not least) language names spelled out in full. Searches would be easier. No information would be lost.

@alexis-michaud raises an excellent point. Language Resource users have varying search habits (or approaches). This includes cases where two or more language names refer to a single language. A search utility/interface needs to be flexible and able to present results from various kinds of input in the search process. This could be like how the terms French/Français/Franzosisch (en/fr/de) are names for the same language or it could be a variety of the following: autoglottonyms (how the speakers of the language refer to their language), or exoglottonyms (how others refer to the language). Additionally, in web based searches I have also needed to implement diacritic sensitive and insensitive logic so that users can type with or without diacritics and not have results unnecessarily excluded.

Depending on how detailed of a search problem HF seeks to solve. It may be better to off load complex search to search engines like OLAC which aggregate a lot of language resources. — as I mentioned above I can assist with the informatics on creating an OLAC feed.

Abstracting search logic from actual metadata may prove a useful way to lower the technical debt overhead. Technical tools and library standards use ISO and BCP-47 Standards. So, from a bibliographic metadata perspective this seems to be the way forward with the widest set of use cases.

lbourdois commented 1 year ago

To get a visual idea of these first exchanges, I coded a Streamlit app that I put online on Spaces: https://huggingface.co/spaces/lbourdois/Language-tags-demo. The code is in Python so I don't know if it can be used by HF who seems to need something in Node.js but it serves as a proof of concept. The advantage is also that you can directly test ideas by enter things in a search bar and see what comes up.

This application is divided into 3 points:

To code these two points, I tested two approaches.

  1. The first one (internal DB in the app) consists in querying a database that HF would have locally at their place. To create this database, I merged the ISO 639 database (https://iso639-3.sil.org/sites/iso639-3/files/downloads/iso-639-3.tab) and the Glottolog database (https://glottolog.org/meta/downloads). The result of this merge is visible in the 3rd point of the application qui is an overview of the database. In the image below, on line 1 of the database, we can see that the Glottocode database gives an ISO 639-3 code (column ISO639P3code) but not the ISO 639 database (column 639-3). Do you have an explanation for this phenomenon? image

For BCP 47 codes of the type fr-CA, I have retrieved the ISO-3166 alpha 1 codes of the territories (https://www.iso.org/iso-3166-country-codes.html). In practice, what I do is if we enter fr-CA is that the letters before the - refer to a language in the Name column for a 639-1 == fr (639-3 for fra or fre) in the base of my image above. Then I look at the letters after the - which refers to a territory. It comes out French (Canada). I used https://cldr.unicode.org/translation/displaynames/languagelocale-name-patterns for the pattern that came up.

  1. The second approach (with langcodes lib in the app) consists in using the Python langcodes library (https://github.com/rspeer/langcodes) which offers a lot of features in ready-made functions. It manages for example deprecated codes, the validity of an entered code, gives languages from code in the language of your choice (by default in English, but also autoglottonyms), etc. I invite you to read the README of the library. The only negative point is that it hasn't been updated for 10 months so basing your tag system on an external tool that isn't necessarily up to date can cause problems in the long run. But it is certainly an interesting source.

Finally, I have added some information on the number of people speaking/reading the language(s) searched (figures provided by langcodes which are based on those given by ISO). This is not relevant for our topic but it could be figures that could be added as information on the https://huggingface.co/languages page.

What could be done to improve the app if I have time:

alexis-michaud commented 1 year ago

Very impressive! Using the prompt 'Japhug' (a language name), the app finds the intended language: image

A first question: based on the Glottocode, would it be possible to grab the closest ISO639-3 code? In case there is no match for the exact language variety, one needs to explore the higher-level groupings, level by level. For this language (Japhug), the information provided in the extracted CSV file (glottolog-languoids-v4.6.csv) is: sino1245/burm1265/naqi1236/qian1263/rgya1241/core1262/jiar1240 One need not look further than the first higher-level grouping, jiar1240, to get an ISO639-3 code, namely jya.

Thus users searching by language names would get ISO639-3 (often less fine-grained than Glottolog) as a bonus. It might be possible to ask the Glottolog team to provide this piece of information as part of an export from their database.

alexis-michaud commented 1 year ago

on line 1 of the database, we can see that the Glottocode database gives an ISO 639-3 code (column ISO639P3code) but not the ISO 639 database (column 639-3). Do you have an explanation for this phenomenon?

That is because the language name 'Aewa' is not found in the Ethnologue (ISO 639-3) export that you are using. This export in table form only has one reference name (Ref_Name). For the language at issue, it is not 'Aewa' but 'Awishira'.

By contrast, the language on line 0 of the database is called 'Abinomn' by both Ethnologue and Glottolog, and accordingly, columns ISO639P3code and 639-3 both contain the ISO 639-3 code, bsa.

The full Ethnologue database records alternate names for each language, and I'd bet that 'Aewa' is recorded among alternate names for the 'Ashiwira' language. I can't check because the full Ethnologue database is paywalled. image

Glottolog does provide the corresponding ISO 639-3 code for 'Aewa', ash, which is an exact match (it refers to the same variety as Glottolog abis1238). In this specific case, Glottolog provides all the relevant information. I'd say that Glottolog can be trusted for all the codes they provide, including ISO 639-3 codes: they only include them when the match is good.

See previous comment about the cases where there is no exact match between Glottolog and ISO 639-3 (suggested workaround: look at a higher-level grouping to get an ISO 639-3 code).

lbourdois commented 1 year ago

I will add these two points to my TODO list.

HughP commented 1 year ago
  • Integrate ISO 3166-1 alpha 2 territories (https://www.iso.org/obp/ui#iso:pub:PUB500001:en)? They offer a finer granularity than ISO 3166-1 alpha 1 which is limited to the country level, but they are very administrative (for French, ISO 3166-1 alpha 2 gives us the "départements" for example).

I'm concerned with this sort of exploration. Not because I am against innovation. In fact this is an interesting thought exercise. However, to explore this thought further creates cognitive dissidence between BCP-47 authorized codes and other code sets which are not BP-47 compliant. For that reason, I think adding additional codes is a waste of time both for HF devs and for future users who get a confusing idea about language tagging.

BenjaminGalliot commented 1 year ago

Good job for the application!

On the Hub, there is the following dataset where French people speak in English: https://huggingface.co/datasets/Datatang/French_Speaking_English_Speech_Data_by_Mobile_Phone Is there a database to take this case into account? I have not found any code in the Glottolog database. If based on an IETF BCP-47 standard, I would tend to tag the dataset with "en-fr" but would this be something accepted by linguists? Based on the first post in this thread that there are about 8000 languages, if one considers that a given language can be pronounced by a speaker of the other 7999, that would theoretically make about 64 million BCP-47 language1-language2 codes existing. And even much more if we consider regionalists with language1_regionalism_x-language2_regionalism_y. I guess there is no such database.

Yes, you noted the difficulty here: that there are so many possible situations. Eventually, each dataset would required descriptors of its own. @BenjaminGalliot points out that, in addition to specifying the speakers' native languages, the degree of language proficiency would also be relevant. How many years did the speakers spend in which area? Talking which languages? In what chronological order? Etc. The complexity defies encoding. The purpose of language codes is to allow for searches that group resources into sets that make sense. Additional information is very important, but would seem to be a matter for 'comments' fields.

To briefly complete what I said on this subject in a private discussion group, there is a lot of (meta)data associated with each element of a corpus (which language level, according to which criteria, knowing that even among native speakers there are differences, some of which may go beyond what seems obvious to us from a linguistic point of view, such as socio-professional category, life history, environment in the broad sense, etc.), which can be placed in ad-hoc columns, or more freely in a comment/note column. And it is the role of the researcher (in this case a linguist, in all likelihood) to do analyses (statistics...) to determine the relevant data, including criteria that may justify separating different languages (in the broad sense), making separate corpora, etc. Putting this information in the language code is in my opinion doing the job in the opposite and wrong direction, as well as bringing other problems, like where to stop in the list of multidimensional criteria to be integrated, so in my opinion, here, the minimum is the best (the important thing is in my opinion to have well-documented data, globally, by sub-corpus or by line)...

If you are going to use Glottolog codes use them after an -x- tag in the BCP-47 format to maintain BCP-47 validity.

Yes, for the current corpora, I have written:

language:
- jya
- nru
language_bcp47:
- x-japh1234
- x-yong1288
  • Add autoglottonyms? (I only handle English language names for the moment)

Autoglossonyms are useful (I use them prior to other glossonyms), but I'm not sure there is an easy way to retrieve them. We can find some of them in the "Alternative Names" panel of Glottolog, but even if we have an API to retrieve them easily, their associated language code will often not be the one we are in (hence the need to do several cycles to find one, which might not be the right one...). Maybe this problem needs more investigation...

For the point of adding the closest ISO 639-3 code for a Glottolog code, what convention should be adopted for the output? Just the ISO 639-3 code, or the ISO 639-3 code - Glottolog code, or the ISO 639-3 code - language name? To use the example of Japhug , should it be just jya, or jya-japh1234 or jya-Japhug?

I strongly insist not to add a language name after the code, it would restart a spiral of problems, notably the choice of the language in question:

HughP commented 1 year ago

To get a visual idea of these first exchanges, I coded a Streamlit app that I put online on Spaces: https://huggingface.co/spaces/lbourdois/Language-tags-demo. The code is in Python so I don't know if it can be used by HF who seems to need something in Node.js but it serves as a proof of concept. The advantage is also that you can directly test ideas by enter things in a search bar and see what comes up.

This is really great. You're doing a fantastic job. I love watching the creative process evolve. It is exciting. Let me provide some links to some search interfaces for further inspiration. I always find it helpful to know how others have approached a problem when figuring out my approach. I will link to three examples Glottolog, r12a's language sub-tag chooser, and the FLEx project builder wizard. The first two are online, but the last one is in an application which must be downloaded and works only on windows or linux. I have placed some notes on each of the screenshots.

Glottolog1 Glottolog2

r12a1

HughP commented 1 year ago

In practice, what I do is if we enter fr-CA is that the letters before the - refer to a language in the Name column for a 639-1 == fr (639-3 for fra or fre) in the base of my image above. Then I look at the letters after the - which refers to a territory. It comes out French (Canada). I used https://cldr.unicode.org/translation/displaynames/languagelocale-name-patterns for the pattern that came up.

What you are doing is looking at the algorithm for Locale generation rather than BCP-47's original documentation. I'm not sure there are difference, there might be. I know that locale IDs generally follow BCP-47 But I think there are some differences such as the use of _ vs. -.

HughP commented 1 year ago

A first question: based on the Glottocode, would it be possible to grab the closest ISO639-3 code? In case there is no match for the exact language variety, one needs to explore the higher-level groupings, level by level. For this language (Japhug), the information provided in the extracted CSV file (glottolog-languoids-v4.6.csv) is: sino1245/burm1265/naqi1236/qian1263/rgya1241/core1262/jiar1240 One need not look further than the first higher-level grouping, jiar1240, to get an ISO639-3 code, namely jya.

Thus users searching by language names would get ISO639-3 (often less fine-grained than Glottolog) as a bonus. It might be possible to ask the Glottolog team to provide this piece of information as part of an export from their database.

This is logical, but the fine grained assertions are not the same. That is just because they are in a hierarchical structure today doesn't mean they will be tomorrow. In some cases the glottolog is clearly referring to sub-language variants which will never receive full language status, whereas in other cases glottolog is referencing to unequal entities one or more of which should be a language. Many of the codes in glottolog have no associated documentation indicating what sort of speech variety they are.

HughP commented 1 year ago

@lbourdois

I'm confused here... if there is no ISO639-3 code in the official database from the registrar, why would you look for an "unofficial" code from someone else? What is the use case here?

alexis-michaud commented 1 year ago

For the point of adding the closest ISO 639-3 code for a Glottolog code, what convention should be adopted for the output? Just the ISO 639-3 code, or the ISO 639-3 code - Glottolog code, or the ISO 639-3 code - language name? To use the example of Japhug , should it be just jya, or jya-japh1234 or jya-Japhug?

(answer edited in view of Benjamin Galliot's comment Easy part of the answer first: jya-Japhug is out, because, as @BenjaminGalliot pointed out above, mixing language names with language codes will make trouble. For Japhug, jya-Japhug looks rather good: the pair looks nice, the one (jya) packed together, the other (Japhug) good and complete while still pretty compact. But think about languages like 'Yongning Na' or 'Yucatán Maya': a code with a space in the middle, like nru-Yongning Na, is really unsightly and unwieldy, not?

Some principles for language naming in English have been put forward, with some linguistic arguments, but always supposing that such standardization is desirable, actual standardization of language names in English may well never happen.

As for jya-japh1234: again, at first sight it seems cute, combining two fierce competitors (Ethnologue and Glottolog) into something that gets the best of both worlds. But @HughP has a point: adding additional codes is a waste of time both for HF devs and for future users who get a confusing idea about language tagging Strong wording, for an important comment: better stick with BCP 47.

So the solution pointed out by Benjamin, from Frances Gillis-Webber and Sabine Tittel, looks attractive: jya-x-japh1234

On the other hand, if the idea for HF Datasets is simply to add the closest ISO 639-3 code for a Glottolog code, maybe it could be provided simply in three letters: providing the 'raw' ISO 639-3 code jya. Availability of 'straight' ISO 639-3 codes could save trouble for some users, and those who want more detail could look at the rest of the metadata and general information associated with datasets.

BenjaminGalliot commented 1 year ago

The problem seems to have already been raised here: https://drops.dagstuhl.de/opus/volltexte/2019/10368/pdf/OASIcs-LDK-2019-4.pdf

An example can be seen here :

3.1.2 The use of privateuse sub-tag In light of unambiguous language codes being available for the two Khoisan varieties, we propose to combine the ISO 639-3 code for the parent language N‖ng, i.e., ‘ngh’, with the privateuse sub-tag ‘x-’ and the respective Glottocodes stated above. The language tags for N|uu and ‖’Au can then be defined accordingly: N|uu: ngh-x-nuuu1242 ‖’Au: ngh-x-auni1243

By the way, while searching for this, I came across this application: https://huggingface.co/spaces/cdleong/langcode-search

alexis-michaud commented 1 year ago

I'm confused here... if there is no ISO639-3 code in the official database from the registrar, why would you look for an "unofficial" code from someone else? What is the use case here?

Hi @HughP, I'm happy to clear what confusion may exist here :innocent: Here is the use case. Guillaume Jacques (@rgyalrong) put together a sizeable corpus of the Japhug language. It is up on HF Datasets (here) as well as on Zenodo.

Zenodo is an all-purpose repository without adequate domain-specific metadata ("métadonnées métier"), and the deposits in there are not easy to locate. The Zenodo deposit is intended for a highly specific user case: someone reads about the dataset in a paper, goes to the address on Zenodo and grabs the dataset at one go.

HF Datasets, on the other hand, allows users to look around among corpora. The Japhug corpus needs proper tagging so that HF Datasets users can find out about it. Japhug has an entry of its own in Glottolog, whereas it lacks an entry of its own in Ethnologue. Hence the practical usefulness of Glottolog. Ethnologue pools together, under the code jya, three different languages (Japhug, Tshobdun tsho1240 and Zbu zbua1234).

I hope that this helps.

julien-c commented 1 year ago

By the way, while searching for this, I came across this application: https://huggingface.co/spaces/cdleong/langcode-search

Really relevant Space, so tagging its author @cdleong, just in case!

alexis-michaud commented 1 year ago

@cdleong A one-stop shop for language codes: terrific! How do you feel about the use of Glottocodes? When searching the language names 'Japhug' and 'Yongning Na' (real examples, related to a HF Datasets deposit & various research projects), the relevant Glottocodes are retrieved, and that is great (and not that easy, notably with the space in the middle of 'Yongning Na'). But this positive result is 'hidden' in the results page. Specifically:

''x-japh1234' parses meaningfully as a language tag according to IANA"

but there is paradoxically no link provided to Glottolog: the 'Glottolog' part of the results page is empty image

Trying to formulate a conclusion (admittedly, this note is not based on intensive testing, it is just feedback on initial contact): from a user perspective, it seems that the tool could make more extensive use of Glottolog. langcode-search does a great job querying Glottolog, why not make more extensive use of that information? (including: to arrive at the nearest ISO 639-3 code)

lbourdois commented 1 year ago

Very interesting things. In my mind I was going to start with a version 1 of the system to deal with the priority of finding a language from a code, and giving a code from a language. In a version 2 I would see a more advanced search system allowing to apply additional sorting like visible on images to be able to search on a part of the name or to apply a Levenshtein distance to manage typos, etc. This system could also contain a filter on all the data mentioned by @BenjaminGalliot in his paragraph:

To briefly complete what I said on this subject in a private discussion group, there is a lot of (meta)data associated with each element of a corpus (which language level, according to which criteria, knowing that even among native speakers there are differences, some of which may go beyond what seems obvious to us from a linguistic point of view, such as socio-professional category, life history, environment in the broad sense, etc.), which can be placed in ad-hoc columns, or more freely in a comment/note column. And it is the role of the researcher (in this case a linguist, in all likelihood) to do analyses (statistics...) to determine the relevant data, including criteria that may justify separating different languages (in the broad sense), making separate corpora, etc. Putting this information in the language code is in my opinion doing the job in the opposite and wrong direction, as well as bringing other problems, like where to stop in the list of multidimensional criteria to be integrated, so in my opinion, here, the minimum is the best (the important thing is in my opinion to have well-documented data, globally, by sub-corpus or by line)...

And this was not explicitly mentioned, but I think it should be included in all the data in the paragraph above, filtering on demographics (age and gender) would also be relevant. Particularly for all ethical AI projects (men are over-represented compared to women in the common voice dataset for example, as well as younger compared to older). HF having an ethical AI team will probably need to tag them when we get to that point.

Regarding the different points made about jya vs jya-japh1234 vs jya-Japhug, your explanations make perfect sense. I would point to the closest ISO 639-3 in the Glottolog tree. I suggested jya-japh1234 and jya-Japhug because on Wikipedia, for the Gallo language, you can see: image As Gallo is not a territory in the ISO-3166 database, I assumed that the proposed code is not a BCP 47 of the language-territory type but rather language_family-language_name which could have been reproduced with Glottolog. From your explanations, I would then tend to think that the code indicated on Wikipedia is not relevant.

It might be possible to ask the Glottolog team to provide this piece of information as part of an export from their database.

Concerning this point, I managed to find the nearest ISO 639-3 for 96.2% of the Glottolog database. An example: image The remaining percentages are languages or dialects where no path is given and are therefore isolated languages not attached to any family (for Atacame, Betoi-Jirara, Chono, Culli, Guachi, Guaicurian, Guamo, Jirajaran, Kariri, Maratino, Matanawí, Mimi-Gaudefroy, Mure, Oyster Bay-Big River-Little Swanport, Payagua, Ramanos, Sechuran, Tallán, Timote-Cuica, Yurumanguí) or including a path but no element of the path has an ISO 639 code (968 languages or dialects concerned). It would be interesting to contact the Glottolog team for these 988 cases. I have put them in the following csv file: Glottocode_wo_ISO693.csv.

The second one is that I fixed the bug that was present in the demonstrator. You can now search for a BCP 47 tag or language without it having to be listed first in the search bar or the application would crash. It can now be indicated anywhere. An example: image

By the way, while searching for this, I came across this application: https://huggingface.co/spaces/cdleong/langcode-search

This is a very relevant point. However, we can see that it is not possible to manage several languages at the same time, that if we enter "French (Canada)" it returns the code "fr" and not "fr-CA" and that for Glottolog it does not return the nearest ISO 639-3 code. Nor does it indicate all the other available languages that are in the same family as the searched language. But it has the advantage of proposing the results of Vachan Engine and the autoglottonyms that I do not propose (for the moment). We haven't talked about Vachan Engine so far, does it seem a relevant source to you?

alexis-michaud commented 1 year ago

The remaining percentages are languages or dialects where no path is given and are therefore isolated languages not attached to any family (for Atacame, Betoi-Jirara, Chono, Culli, Guachi, Guaicurian, Guamo, Jirajaran, Kariri, Maratino, Matanawí, Mimi-Gaudefroy, Mure, Oyster Bay-Big River-Little Swanport, Payagua, Ramanos, Sechuran, Tallán, Timote-Cuica, Yurumanguí) or including a path but no element of the path has an ISO 639 code (968 languages or dialects concerned). It would be interesting to contact the Glottolog team for these 988 cases. I have put them in the following csv file: Glottocode_wo_ISO693.csv.

What would be the question to the Glottolog team? I looked at a few of these cases and they look straightforward. Starting with the easy part: as you noted, some languages are not attached to any family. They are isolates, i.e. languages having no known relatives. They do not recognizably belong in any higher-level grouping. So, in case there is no ISO 639-3 for such a language, it is not possible to grab the closest one: there is no (known) closest language. Hence, the search for the closest ISO 639-3 code does not yield anything. Now, to cases where a path is given but no element of the path has an ISO 639-3 code. As one goes up in the path (hierarchy), it is less and less likely that an ISO 639-3 code can be found, as ISO 639-3 codes are for languages, and neither for lower levels (dialects, sociolects...) nor for higher levels (groups of languages, like Northwest Germanic or Romance). To get an ISO 639-3 code, one needs to 'hit' exactly the level chosen by Ethnologue as the proper level for a 'language'. It seems that Glottolog maintainers deliberately choose to avoid giving the next closest ISO 639-3 code in cases where they think it could be misleading. Such an example is Lebu Wolof. The Glottolog entry explains why they do not want to give either wol or wof as a three-letter code:

Lebu Wolof is listed in E16/E17/E18/E19/E20/E21/E22/E23/E24 as a dialect of (Senegalese-Mauretanian) Wolof [wol]. However, Lebu Wolof is not intelligible to other Wolof speakers (Cissé, Mamadou 2004: 14), e.g., Wolof speakers cannot readily understand the text specimina of Lebu in Angrand, Armand-Pierre 1952, as opposed to Gambian Wolof [wof] which is intelligible to (Senegalese-Mauretanian) Wolof [wol] but uses a different written standard and source of loanwords. The distinctness of Lebu Wolof is hidden by the fact that all Lebu Wolof speakers are also bilingual in non-Lebu Wolof. See also: Gambian Wolof [wof], Wolof [wol].

Lebu Wolof connects with 'plain' Wolof (wol) and Gambian Wolof (wof) at the first phylogenetic node, labelled 'Wolofic'. But that does not have a corresponding ISO 639-3 code, and thus no connection with wol and wof can be established through the 'upward path search' that you (@lbourdois) implemented. The information made available in the Glottolog entry makes it clear to human readers that Lebu Wolof is (as the name '... Wolof' strongly suggests) very close to [wol] and [wof], but retrieving that piece of information automatically is not at all straightforward. My personal point of view (for what it's worth) is that it would be good for HF Datasets users to have access to the nearest ISO 639-3 code in such cases (for Lebu Wolof: [wol] or [wof]). In this light, it would be great for interoperability if Glottolog provided a field containing the closest ISO 639-3 code(s) (with the required caveat that there is not a perfect match). That sugestion was made in their GitHub repo. But I appreciate that, from their perspective, it may appear as unnecessary and potentially misleading: additional work with no obvious gain for Glottolog.

alexis-michaud commented 1 year ago

We haven't talked about Vachan Engine so far, does it seem a relevant source to you?

This is the first we hear of it, I'm afraid...

cdleong commented 1 year ago

Hello all! I'm delighted that people are finding the space useful. I coded it up because the language code thing is a perennial problem for me and I needed a quick/easy way to look things up!

@alexis-michaud regarding the Glottolog lookups, you're quite right that the formatting with a big "Failure" hiding the fact that we got the glottolog id successfully is a bit... odd. Any suggestions on updating it? I think my thinking here was basically that if parsing the BCP 47 was not successful then I was unsure about subsequent lookups.

As for how I'm retriving those, behind the scenes here I'm basically just doing some very intro-level requests queries to glottolog.org, the source code's visible here: https://huggingface.co/spaces/cdleong/langcode-search/blob/main/app.py. Specifically right here: https://huggingface.co/spaces/cdleong/langcode-search/blob/main/app.py#L53. I'm not super great at requests so I'd be happy to incorporate any suggestions.

cdleong commented 1 year ago

@lbourdois regarding Vachan Engine, it's an experimental database run by some linguists I know. They're trying to build a big database mostly for Bible-related things. More here: https://github.com/Bridgeconn

cdleong commented 1 year ago

I haven't touched this in a while, I'd be quite happy to get ideas/suggestions for improvements.

https://huggingface.co/spaces/cdleong/langcode-search/blob/main/app.py#L7 is one idea I've had, to also have a look and try to provide the relevant Wikipedia codes. But that's a whole complex thing, see https://en.wikipedia.org/wiki/List_of_Wikipedias#Wikipedia_edition_codes

Edit: see also https://meta.wikimedia.org/wiki/List_of_Wikipedias#Nonstandard_language_codes

Edit 2: Oh! They've got a search feature! https://incubator.wikimedia.org/wiki/Special:SearchWiki

cdleong commented 1 year ago

Vachan has a bit of a focus on Indian languages, and various structured info as well. So I threw it in as an extra

HughP commented 1 year ago

2. The second approach (with langcodes lib in the app) consists in using the Python langcodes library (https://github.com/rspeer/langcodes) which offers a lot of features in ready-made functions. It manages for example deprecated codes, the validity of an entered code, gives languages from code in the language of your choice (by default in English, but also autoglottonyms), etc. I invite you to read the README of the library. The only negative point is that it hasn't been updated for 10 months so basing your tag system on an external tool that isn't necessarily up to date can cause problems in the long run. But it is certainly an interesting source.

I just read the documentation and the use-cases presented in the application. If it does all that, it is an amazing library (or suite of libraries)! Very useful. I wonder if there is a command to update its data sources so that with a cron job it could be set to check for the latest IANA, ISO 639-3 and Script codes? If the data sources can be updated independent of the code base then I see not reason not to include this library as part of a production workflow... Still an accompanying library may need to be built (as a plugin) to accomplish additional other goals.

HughP commented 1 year ago

As Gallo is not a territory in the ISO-3166 database, I assumed that the proposed code is not a BCP 47 of the language-territory type but rather language_family-language_name which could have been reproduced with Glottolog. From your explanations, I would then tend to think that the code indicated on Wikipedia is not relevant.

This assumption ought to be revisited. The tag fr-gallo is valid because gallo is a valid subtag in the IANA database, not because it is a region. So ,ISO 639-2 and 639-3 both rejected arguments that gallo was a language citing different criteria in their independent application processes. However, the managers of the IANA sub-tag registry found sufficient evidence for a sub-tag. Hence how the sub-tag is valid on wikipedia. The inclusion of the tag in the IANA database makes the code a valid BCP-47 code.

cdleong commented 1 year ago

Regarding langcodes, it seems it pulls from this file: https://github.com/rspeer/langcodes/blob/master/langcodes/data/language-subtag-registry.txt

..and from this database: https://github.com/rspeer/language_data

langcodes can also refer to a database of language properties and names, built from Unicode CLDR and the IANA subtag registry, if you install language_data.

cdleong commented 1 year ago

Looking through the langcodes repo, it seems they had to manually update various code/files for CLDR v 40. And also it seems that there is a new version of CLDR out now, so langcodes probably ought to be updated: https://cldr.unicode.org/index/downloads lists a v41.

edit: v41 documentation: https://cldr.unicode.org/index/downloads/cldr-41, it seems not a lot changed to the actual language data.

HughP commented 1 year ago

@cdleong I sit on the IANA mailing list which comments on sub-tag proposals. I know there is no annual cycle for updates, it just happens all the time. A proposal comes in has a two week review process and then is considered. I also know that the ISO 639-3 code set has an annual review process which updates only annually. Usually in Feb/March... I'm not sure what the CLDR cycle is — I think it is annual. I'm also not sure what the ISO15924 (scripts) cycle is. It might be connected to the CLDR cycle as there is a large overlap in those two communities of technologists. So when considering a HF pipeline these updates to the codes available should happen with regularity.

HughP commented 1 year ago

Lebu Wolof connects with 'plain' Wolof (wol) and Gambian Wolof (wof) at the first phylogenetic node, labelled 'Wolofic'. But that does not have a corresponding ISO 639-3 code, and thus no connection with wol and wof can be established through the 'upward path search' that you (@lbourdois) implemented. The information made available in the Glottolog entry makes it clear to human readers that Lebu Wolof is (as the name '... Wolof' strongly suggests) very close to [wol] and [wof], but retrieving that piece of information automatically is not at all straightforward.

@alexis-michaud This is a really great example with presumably great documentation. I haven't read the resources, but I'm convinced on the extent of the evidence I see. In this case a valid BCP-47 code could be constructed using the ISO 639-3 code mis see: https://en.wikipedia.org/wiki/ISO_639-3#Special_codes Maybe this is a way forward in these cases? Perhaps a BCP-47 code like mis-countryCode-x-glottocode. Something in that syntax that should validate.

alexis-michaud commented 1 year ago

Lebu Wolof connects with 'plain' Wolof (wol) and Gambian Wolof (wof) at the first phylogenetic node, labelled 'Wolofic'. But that does not have a corresponding ISO 639-3 code, and thus no connection with wol and wof can be established through the 'upward path search' that you (@lbourdois) implemented. The information made available in the Glottolog entry makes it clear to human readers that Lebu Wolof is (as the name '... Wolof' strongly suggests) very close to [wol] and [wof], but retrieving that piece of information automatically is not at all straightforward.

@alexis-michaud This is a really great example with presumably great documentation. I haven't read the resources, but I'm convinced on the extent of the evidence I see. In this case a valid BCP-47 code could be constructed using the ISO 639-3 code mis see: https://en.wikipedia.org/wiki/ISO_639-3#Special_codes Maybe this is a way forward in these cases? Perhaps a BCP-47 code like mis-countryCode-x-glottocode. Something in that syntax that should validate.

Do we agree on the goal? Use case: doing transfer learning between closely related languages/dialects. Browsing through the catalogue of HF datasets, users want to identify new language pairs, in order not to be stuck with hackneyed language pairs like British English<->American English, Dutch<->German, Spanish<->Portuguese and such. It's so much more motivating to work on exciting real challenges, and break new ground. The language tags should allow HF Datasets users to realize that Lebu Wolof is closely connected to 'plain' Wolof (wol) and Gambian Wolof (wof). Labelling the language as mis-sn-x-lebu1234 may allow the code to pass muster by BCP-47 standards. But how would it achieve the goal of relating the mis-sn-x-lebu1234 datasets (Lebu Wolof) with the datasets in other Wolof dialects: those tagged as [wol] and [wof]?

Using a flat list of closest ISO 639-3 codes for Glottocodes, Lebu Wolof would (i) be tagged by its Glottocode, lebu1234, and (ii) be tagged as having wol or wof (or both) as the closest ISO 639-3 code(s). (Why not also tagged with (iii) a BCP-47 code: mis-sn-x-lebu1234 -- but there may be a potential for issues setting a country code.)

Selection of a closest match in terms of ISO 639-3 codes when none is 'spot-on' is admittedly a compromise, but one that quite a few linguists are likely to be accustomed to (even if they are not necessarily 100% comfortable with it), because repositories that belong to the OLAC network (Open Language Archives Community) require an ISO 639-3 code for each deposit. Thus, Laze, a language on which I did fieldwork, does not have an ISO 639-3 code, so I chose the one that seemed closest (nru) when depositing documents like this one. (Maybe OLAC will also accept Glottocodes as language labels in future, but that is a separate topic.)

For HF Datasets users, the ISO 639-3 code is a good first-pass criterion. Then of course users want to refine their search & put together data sets for their experiments, selecting their training corpus with the same care as cider brewers combine varieties of cider apples :apple: :apple: :apple: :green_apple: :green_apple:

HughP commented 1 year ago

Do we agree on the goal? Use case: doing transfer learning between closely related languages/dialects.

@alexis-michaud

I think from an engineering perspective we need to have a clear data model. The language tag should directly relate to the content of the language resource. It should also only be applied in cases where it is accurate. "Near misses" or "near hits" introduce an element of bias into ML/AI/MT based work and will (eventually) have unintended social consequences. A clear data model here also highlights what kind of search and retrieval processes must be run by the search interface and software. It is important, I think, to provide relevant or "close" matches for usability purposes. However, relevance should not be encoded on the language tag, rather it needs to be in a relevance table that the search and retrieval software consults. However, all language resources should have a language tag.

alexis-michaud commented 1 year ago

An update from the Glottolog team, answering the question whether an ISO code of scope "individual" could be provided for each Glottocode:

"I think Glottolog could provide this, presumably in the [glottolog-cldf dataset]. While this dataset isn't a "just-one-table" dataset, I'd still think it is easy enough to access to make integration into a HuggingFace workflow possible." (By @xrotwang)

(For instance, as set out in the discussion above: providing the ISO 639-3 code nru for Yongning Na (Glottocode: yong1288), and the code jya for Japhug (Glottocode: japh1234).)

lbourdois commented 1 year ago

Hi everybody I wish you a very happy new year and all the best! ✨ I had planned to work again on this subject before the end of 2022 to try to make it progress but finally didn't have time (the HF course took me away). Not being able to afford to contribute to open-source projects in 2023 on my personal time, I won't be much help this year either. In an effort to move things forward, let me tag @dwhitena who works at SIL, the organization that updates the ISO639 codes every year. I hope that this will help in the exchanges to reach a system of language tags taking into account as many languages as possible and maintainable over time :)

julien-c commented 1 year ago

Thanks @lbourdois!

lbourdois commented 2 months ago

Hi,

I realize that there was no message in this issue to warn that all ISO-639-3 language tags have been available on Hugging Face for about 6 months. I think this should meet the needs expressed in this issue and that it can therefore be closed.

With this logic I closed the application mentioned in https://github.com/huggingface/datasets/issues/4881#issuecomment-1237949254 because it is no longer relevant. For those interested, the tags are also available in this dataset: https://huggingface.co/datasets/lbourdois/language_tags.

Hoping to see datasets from the CNRS and more widely from linguists soon on the Hugging Face Hub :)