For HuggingFace: indicating closest ISO 639-3 code?

alexis-michaud commented 1 year ago

Esteemed maintainers of Glottolog,

An opportunity to connect state-of-the-art Natural Language Processing (NLP) with linguistics in its unrestricted, 'full-diversity' mode: HuggingFace Transformers meet Glottolog?

NLP colleagues go to HuggingFace datasets to run experiments on all the language data they can lay their grubby hands on. It seems important to 'push' data from less-documented / less-studied / less-resourced / endangered languages up there, as a contribution to connecting the world of language documentation, description & conservation with the world of state-of-the-art NLP research. The stakes are high for both fields. (For anyone interested in longer reads, there's the argument of the ComputEL conference series, for instance.)

Currently, HuggingFace uses IANA language codes, not Glottolog codes. Thus for Japhug and Na: the datasets from the Pangloss Collection have been made available as HuggingFace datasets, here. We would like to use Glottocodes to identify the language varieties of these two corpora: japh1234 for Japhug, and yong1288 for Na.

But we can't input those in the metadata. @BenjaminGalliot had to remove Glottolog codes and only provide the closest ISO 639-3 equivalents. Glottolog are currently confined to (i) corpus card description, (ii) language details, and (iii) subcorpora names. (Pull request and discussion are here.)

It makes trouble for linguists, for reasons which are obvious to us but not so for computer science researchers. Thus, Japhug is one of the rGyalrongic (=Jiarong, rGyalrong) languages: it does not have an Ethnologue (3-letter) code of its own. So labelling as 'jya' (Jiarong) is under-specific.

For want of proper referencing in the metadata, finding 'Japhug' becomes really hard (defeating the purpose of the whole plan of a HF deposit): as the language name is not foregrounded in the metadata, a search for 'Japhug' among corpora returns zero results. ('Na' has false positives like 'Vietnamese'.) Another occasion to confirm that we really want Glottocodes!

Wouldn't it be great if Glottolog, a CIL (Cool and Internationally Leading) database of language names (and more), committed to Open Science, met HuggingFace, a CIL (Cool and Internationally Leading) group: "The AI community building the future" committed to Open Science?

Specifically, the question raised by the HF team (here) is: "is there a DB of language codes you would recommend? That would contain all ISO 639-1, 639-2 or 639-3 codes and be kept up to date, and ideally that would be accessible as a Node.js npm package?"

Ball's in your court for answering this question, not? Looks like an opening for adoption of Glottocodes (with ISO compatibility), for the mutual benefit of NLP research and linguistics+language documentation, doesn't it? To what extent would pyglottolog fit the bill / do the job? (API documentation here) I'm reaching my technical limitations here: I can't assess the distance between what they offer and what the HF team needs.

alexis-michaud commented 1 year ago

(this issue probably belongs in pyglottolog rather than here, right?)

xrotwang commented 1 year ago

No, it belongs here, I think, because it's about an additional distribution format of Glottolog data.

I'm still not entirely sure, what exactly Glottolog should provide here. Maybe glottocodes registered as BCP47 codes might have the same effect?

Alexis Michaud @.***> schrieb am Mi., 24. Aug. 2022, 22:40:

(this issue probably belongs in pyglottolog https://github.com/glottolog/pyglottolog rather than here, right?)

— Reply to this email directly, view it on GitHub https://github.com/glottolog/glottolog-cldf/issues/13, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUOKEFGUBBOBG26Y6GNA3V22CCTANCNFSM57PRZOQQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

alexis-michaud commented 1 year ago

Glottocodes registered as BCP47 codes: indeed, that is what was recently suggested in the discussion (3rd point here): using Glottolog codes after an -x- tag in the BCP-47 format to maintain BCP-47 validity. Maybe that's all that needs to be said?

I'd still have a suggestion concerning an additional distribution format of Glottolog data tailored to the needs of computer scientists. (With apologies in case this comment is wide of the mark: I'm not an expert on computational matters.) I have a feeling that it would be useful (even though it's redundant information) to indicate the closest ISO 3-letter code equivalents for all language varieties, including "dialects".

The automatically generated list that Benjamin Galliot produced with pyglottolog (see here) lacks a 3-letter code for dialects (columns G and H in Benjamin's spreadsheet). The absence of this piece of information (currently useful for various purposes) might cause enough friction that someone who wants to do simple table lookup would go and search elsewhere for a simpler tool.

Thus, Yongning Na is (correctly) indicated in Glottolog (as yong1270) as corresponding to the 3-letter language code nru (links to Ethnologue and OLAC are provided), but this piece of information is not copied into the lines for its two dialects (Lataddi, lata1234, and Yongning, yong1288). So the 3-letter language code nru does not appear in Benjamin's pyglottolog export for the Lataddi (line 23194) and Yongning (line 23195) dialects. The fact that both come under Ethnologue nru is not accessible through table lookup. To obtain this piece of information ("What is the closest ISO 639-3 code?"), it is necessary to go through a reasoning involving several steps, requiring familiarity with how Glottolog is structured: (i) realizing why the information is absent: "it's not a bug, it must be because this entry is of a subtype that does not have this information... Yes! Column N says this entry is a "dialect", not a "language", and “dialects" do not always get an ISO 639-3 code of their own in this table". (ii) finding one's way to the information on the higher-level grouping to which the variety at issue belongs ("this is a dialect of... let's see... yong1270! OK let's move to the corresponding line. Good, now we can look up the ISO 639-3 column for that higher-level grouping ("language"). Here it is: nru!")

My impression (for what it's worth) is that computer scientists who want a database of language names would want the information laid out flat in the table. Looking up a Glottocode and finding the closest ISO 639-3 code in the relevant column on the same line would be cool & helpful. So the table would be like this: (I added the missing 3-letter codes manually, but it could be done automatically based on the information in the right-hand column, of course.)

I can't see any clear drawback in adding the closest 3-letter code for all language varieties. Then linguists (like me) could more easily push for the use of Glottocodes, with the argument that by using Glottocodes you also get the ISO 639-3 codes. (People who care about the many caveats surrounding language codes and language names can always find information & discussions elsewhere.)

Don't know if this makes any sense as seen from Kahlaische Straße 10? :)

alexis-michaud commented 1 year ago

Just in case someone from the Glottolog team feels like jumping in, the conversation on the HuggingFace repo is continuing.

alexis-michaud commented 1 year ago

I have changed the Issue title: apologies for switching the goalposts, but it seems the Hugging Face team is not so focused on getting the database accessible as a Node.js npm package. Instead, recent episodes in the discussion focused on how to retrieve the next closest ISO 639-3 code for a given Glottocode. The idea is to use Glottocodes, and also to use the Glottocode to arrive at the nearest ISO 639-3 code. Thus, the metadata of a given dataset would contain:

a Glottocode tag (e.g. japh1234 for the Japhug language)
a matching ISO 639-3 tag when there is one (for Japhug: there is no matching tag, so that field would be empty)
if (2) is not provided: a 'next closest' ISO 639-3 tag when one is available (e.g. for Japhug: jya), with the caveat that (3) is not an exact match.

(Empty fields for both (2) or (3) would indicate that the language at issue is an isolate and lacks an ISO 639-3 code.)

alexis-michaud commented 1 year ago

As noted in the conversation on the HuggingFace repo, I appreciate that, from the perspective of Glottolog, it may appear as unnecessary and potentially misleading to provide next-closest ISO 639-3 codes: additional work with no obvious gain. It could help for interoperability, though. So I make bold to mention the discussion to the team. Apologies for the 'noise' in case this is irrelevant. All best wishes

xrotwang commented 1 year ago

I'll be back from vacation next week and will try to answer then.

Alexis Michaud @.***> schrieb am Di., 6. Sep. 2022, 21:22:

As noted in the conversation on the HuggingFace repo https://github.com/huggingface/datasets/issues/4881#issuecomment-1238541158, I appreciate that, from the perspective of Glottolog, it may appear as unnecessary and potentially misleading to provide next-closest ISO 639-3 codes: additional work with no obvious gain. Just thought I'd mention the discussion to the team. All best wishes

— Reply to this email directly, view it on GitHub https://github.com/glottolog/glottolog-cldf/issues/13, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUOKFVHLKOSKHSB2B3GXTV46KXBANCNFSM57PRZOQQ . You are receiving this because you commented.Message ID: @.***>

xrotwang commented 1 year ago

@alexis-michaud with "next-closest ISO 639-3 code" you mean an ISO code of scope "individual", right? Otherwise, there could be multiple closest ISO codes - of dubious value. If so, I think Glottolog could provide this, presumably in the [glottolog-cldf dataset]. While this dataset isn't a "just-one-table" dataset, I'd still think it is easy enough to access to make integration into a HuggingFace workflow possible.

alexis-michaud commented 1 year ago

Exactly, that is the idea: "an ISO code of scope "individual"". Great to hear that it is feasible and not hopelessly difficult.

glottolog / glottolog-cldf

For HuggingFace: indicating closest ISO 639-3 code? #13