XachaB commented 3 years ago

Hi,

Background: I am writing the methods section for the sound correspondences paper. I evaluated the simple "cognate detection" method we used against the expert annotations in Lexicore datasets. The results are very poor (precision 59%, recall 34%). Since lingpy already has a class to do cognate detection properly (LexStat), I want to refactor the lexitools correspondences code to use it instead. Otherwise, I can't trust the results.

Problem:

I read the doc: LexStat expects a qlc filename in input.

However, I am reading several CLDF datasets, combining them per genus, and searching for cognates in each. I would need to pass lists of rows from different CLDF datasets (or some object wrapping dataset rows).

I will also need to disregard LexStat output whenever there was actually some expert annotation, since for each genera, some data will come from a single dataset with cognacy jugments, and some data will have been aggregated across datasets. While this seems simple (for each pair, check if it was annotated), I would like to be sure whether there isn't a trick that would make this setup a bad idea.

Questions:

Is there an existing way to do what I need ?
Does LexStat scale well on large datasets ?
Is it ok to disregard some, but not all, of LexStat output (only when the pair has expert annotation) ?

Notes:

Ideally, I would prefer not to have to write a few enormous csv files to disk each time the code is run, changing headers from CLDF to LexStat conventions, then read them again, etc.
I saw WordList can take a cldf metadata path as its input, but I have a mix of rows from several datasets, and I need LexStat object, not WordList.
Maybe I need to subclass LexStat ?
The "cognate detection" method which is not good enough was: align the sound class sequences, keep all pairs of words above a similarity threshold, the threshold grows with syllable length)

@LinguList Do you have a solution ?

LinguList commented 3 years ago

We're currently working on a new package, called cltoolkit, which reads in a cldf dataset, checks for all kinds of consistency checks regarding lexibank, and also allows to convert to a lingpy wordlist, thereby merging data from different cldf datasets, allowing also to filter.

This is probably the way to work.

You work with languages on something like a genus level, right? If so lexstat won't have a problem in scaling up.

We're still refactoring cltoolkit, but should be done so in one or two weeks, should we then share the relevant code with you? I assume, we'll then also rather quickly just make the code public, so it can be used as a normal dependency. Since cltoolkit checks when loading for segments conforming to CLTS, it is probably also useful for the code on sound correspondence detection.

XachaB commented 3 years ago

Thanks ! The package you mention sounds like what I need, and I am very interested in the code.

I am a little bit worried that the sound correspondence project keeps proving Hofstadter's law right: each time I think I have something ready, it ends up requiring a few more weeks of work. Do you think the "one or two weeks" is an optimistic estimate (and it might be a month or two instead) or will definitely be usable code in a week or two ?

If it's the former, maybe in the meantime I should just write a quick thing that does the work (in the worst case scenario, writing and reading from a bunch of temporary files), so that I can make progress.

LinguList commented 3 years ago

@xrotwang is now checking the code to make sure that fundamental problems do not occur, and we'll use the package as a backbone for our lexibank study, which we plan to submit soon.

You can check already now, but it is private still, which is why I'd recommend to wait.

But to make a lexstat analysis for cognate set of more than one CLDF dataset, an intermediate code is also as simple as:

idx = 1
namespace = (("id", "lexibank_id"), ("language_id", "doculect"), ("concept_concepticon_gloss", "concept"), ("segments", "tokens"), ("language_glottocode", "glottocode"))
D = {0: [x[1] for x in namespace]}
for path2cldf in paths2cldf:
    wl = Wordlist.from_cldf(path2cldf, columns=[x[0] for x in namespace], namespace=namespace)
    for idx_ in wl:
        if wl[idx_, "concept"]:
            D[idx] = [wl[idx_, col[1]] for col in namespace]
            idx += 1
wl = Wordlist(D)
print(wl.height, wl.width)

LinguList commented 3 years ago

Just tested this with:

paths2cldf = ["allenbai/cldf/cldf-metadata.json", "wangbai/cldf/cldf-metadata.json"]

XachaB commented 3 years ago

Thanks for these tips ! If I need a LexStat instance rather than a Wordlist, can I do LexStat.from_cldf instead ?

LinguList commented 3 years ago

I recommend to use that afterwards: you say lex = LexStat(D), since lexstat does internal conversions so it would be faster to load in this way. You can use the code also to group by families, and the like, of course: the Dictionary representation is the internal representation of a LingPy.Wordlist, so you fill this as I have shown, and you can then initiate it with LexStat, Wordlist, and any other class derived from Wordlist.

XachaB commented 3 years ago

Ah great, I did not see that LexStat could be initialized from a dictionary of rows ! That is perfect. Looking at the source code it seemed to always need only a file path.

I think this is enough so that I can start writing something: load datasets using Wordlist.from_cldf, do any filtering and transformations I need, construct dictionaries for each genera, then initialize LexStat with the dicts, etc.

XachaB commented 3 years ago

As a consequence of finding that cognate detection in our previous setup was insufficient to be able to claim anything, I have made a big update to the correspondence code. It is now in the branch SdCorrespWithLexStat.

Changes:

I dropped the discutable "genus" level and work on a per-family basis,
I use Lingpy's Lexstat & Partial for cognate detection,
I use Lingrex to consolidate sites into correspondence patterns,
The data is read using lexibank_analysed in order to get the "official" lexicore list of datasets. Installing datasets with cldfbench download cldfbench_lexibank_analysed.py before running the sound correspondence script is necessary (and more convenient than the previous method),
The coarsening class and mapping were updated for compatibility with clts v2.1.0
The output is not pairwise correspondences, but full correspondences based on multiple alignments
I did a lot of refactoring to make the new process as clear and light as possible (in particular on memory).

I will very soon run this on all of lexicore on a server (it has gotten too slow to run on my machine), and should be back with some news when I have a first result. Once I'm sure it runs on everything smoothly, I'll make a PR, no need to inspect the code before then.

@LinguList , @SimonGreenhill , @erichround , all my apologies for taking so much time to make progress on this: unfortunately, this is quite a large amount of work, and I can only work on it part of the time.

I am much, much, more familiar with Lingpy now, which is always a nice perk ;)

LinguList commented 3 years ago

Sounds cool, as this is what I wanted to know how it works for a long time, especially when including the sound correspondence patterns to look for the most stable / most frequently recurring patterns :)

SimonGreenhill commented 3 years ago

Awesome! btw, let me know if you want to run it on the jena cluster. Looking forward to some results :)

XachaB commented 3 years ago

Thanks. For now I'm trying to run it on the Surrey cluster, maybe I'll ask if/when I use up all my allocated resources there !

Right now I keep blowing up the memory I request (64G over 10 CPU is still not enough). I'll see if I can either raise the memory request, or lower the number of CPUs, or maybe rewrite the thing to be parallel in a dumber way, that is to say, first export a csv for each family, then run an entirely separate process on each family.

I unfortunately have to time box this to Fridays (...and weekends), so I should only get back to it in a week.

XachaB commented 3 years ago

I have some little issues happening with Glottolog, maybe one of you @LinguList @SimonGreenhill have the answer ?

1. Trusting language information in the absence of glottocode ?

First, so far I have been ignoring entirely languages given with no glottocodes:

https://github.com/lexibank/lexitools/blob/d0bdd1a4b2ff7b7fe6b64ecd490207d7b64ff876/src/lexitools/commands/correspondences.py#L293

But I am starting to understand that this is a really frequent occurence, and it seems like a shame to throw out so much data.

The first alternative would be to keep those languages whenever there is a "Family" column in the language table. However, I thought the point of having glottocodes in cldf was to be certain that these types of info are properly standardized, and by using whatever is given as "Family", I open the door to non-meaningful variation. I can always normalize case, but we know that it is still possible that this would lead to duplication of families etc.

A third possibility would be to produce a list of everywhere with missing glottocodes and put some annotator efforts (if you still have time for people doing this) towards filling them in. However, I suspect that in many cases the languages might not be mappable to glottocodes. In that case, what is the best solution ?

What is your point of view on this ?

2. What is this "BookKeeping" family in glottolog ?

Glottolog has a family called "BookKeeping":

https://glottolog.org/resource/languoid/id/book1242

I imagine this is some meta-info for glottolog maintainers ? But why is it given as a family ? For example, https://github.com/lexibank/chindialectsurvey has several varieites classified as "BookKeeping" (in fact Sino-Tibetan), and this false family leads my script to attempt to find cognates in a set of unrelated languages (while keeping them from the rest of their families).

Any idea how to deal with this ? Is the info hidden elsewhere ? Should I fall back on whatever the dataset specifies (see problem in 1.) in that case ?

xrotwang commented 3 years ago

Ad 1) I'd much prefer putting in some effort to fill the gaps - i.e. assign (or even mint) missing glottocodes. I would want the intuition "no Glottocode => unreliable language info" to be valid.

Ad 2) "Bookkeeping" is a drawer for putative languages in Glottolog that turned out not to be "real". Since Glottolog has a policy of never expiring language-level Glottocodes, these need to be put somewhere. Now, the way Glottolog (technically) groups languages is via file-system directories, and this mechanism is used for both, pseudo-families and families. This is somewhat unfortunate, but might be alleviated with more docs and some help from pyglottolog, see https://pyglottolog.readthedocs.io/en/latest/languoids.html#pyglottolog.languoids.Languoid.category

chrzyki commented 3 years ago

Hi @XachaB,

thanks for your work on these tools and analyses.

Regarding 1:

I'd be interested to learn how many cases like this (i.e. no Glottocodes) come up for your particular configuration? Also I'm not sure I fully understand how a case of Family exists but Glottocode not exists could happen? Maybe I'm misunderstanding something here. At any rate, I could certainly help with fixing Glottocode issues.

Regarding 2:

Bookkeeping is also explained in some detail here. (Just saw Robert's comment, so I'm keeping this short)

Regarding cluster:

If you'd like I could also help with setting this up here. Plenty of RAM to work with.

XachaB commented 3 years ago

Thanks for your quick answers !

1a. I can easily generate a table of all languages & datasets which are missing glottocodes, will do this Friday if I can. Maybe this can be a first step towards fixing them.

1b. @chrzyki Regarding how I might be able to recover Family where there is no Glottocode: I mean that the language table sometimes specifies a Family for each language, so I could get this information without querying glottolog. Ex, many rows without glottocodes but with a family here:

https://github.com/lexibank/chindialectsurvey/blob/master/cldf/languages.csv

Of course, this is really not as good as getting the family from glottolog, which ensures that everything is standardized.

2) So, does that mean I should simply ignore all of the data which is in "bookkeeping" ? I understand that these langoids are not seen as real by Glottolog editors (and maybe the wider linguistic community), but we do have lexibank datasets with these langoids. For example, datasets such as joophonosemantic, gaotb, chindialectsurvey, servamalagasy have languages classified as BookKeeping.

Re: server, thanks for the offer, I'll definitely follow up on it if I need !

xrotwang commented 3 years ago

Yes, I think data for "bookkeeping" languages should be ignored. Ideally, it should be possible (and - again ideally - not too hard) to figure out matching non-bookkeeping glottocodes, because often there are associated ISO change requests which recommend remedies.

XachaB commented 3 years ago

Noted, then I'll make a list of both languages without glottocodes & languages with bookkeeping codes, as both require manual annotations.

I have to say, I find it exciting that writing analyses at this larger scale helps improve the individual datasets. :)

SimonGreenhill commented 3 years ago

Agree with everything above, happy to hunt for glottocodes too.

Re bookkeeping -- I wonder how many of these can be 'fixed' (i.e. they are languages we have data from so they're not spurious)

XachaB commented 3 years ago

Here I am with a list of datasets with missing glottocodes, or glottocodes with some issues. For the full list see the attached file.

In total this amounts to 424 languages which currently are ignored by the lexitools/correspondences tool.

Here it is split by problem:

Book Keeping langoids

dataset	glottocode	language_ID	language_Name	ISO639P3code
chindialectsurvey	wela1234	RawngtuWeilong-A	Rawngtu Weilong	weu
chindialectsurvey	wela1234	RawngtuRamtim-A	Rawngtu Ramtim	weu
gaotb	yuan1242	MojiangYi	Yi (Mojiang)	yym
gaotb	naxi1246	LijangNaxi	Naxi (Lijiang)	nbf
gaotb	naxi1246	YongningNaxi	Naxi (Yongning)	nbf
johanssonsoundsymbolic	lenc1244	Lencasalvador	Lenca-Salvador	len
johanssonsoundsymbolic	sana1281	Sanapanaangaite	Sanapaná (Angaité)	sap
joophonosemantic	chua1256	ChuanqiandianClusterMiao	Chuanqiandian Cluster Miao	cqd
servamalagasy	sout3125	BetsimisarakaMarolambo	Betsimisaraka	bjq

The langoid is not known by pyglottolog

dataset	glottocode	language_ID	language_Name	ISO639P3code
dunnaslian	sema1250	Semnam_Malau	Semnam Malau	ssm
dunnaslian	teim1246	Temiar_Perak	Temiar Perak	tea
dunnaslian	monn1258	Mon	Mon	mnw
kesslersignificance	nucl1201	Turkish	Turkish
polyglottaafricana	maka1261	MakhuwaMeetto	Makhuwa-Meetto
saenkoromance	vall1248	valladerromansh	Vallader_Romansh
saenkoromance	cagl1238	campidanese	Campidanese
servamalagasy	meri1291	MerinaMaevatanana	Merina
sidwellbahnaric	kass1248	kasseng	Kasseng
transnewguineaorg	cent2257	proto-central-sogeram	Proto-Central-Sogeram
zgraggenmadang	sali1249	maia-saki	maia-saki
zgraggenmadang	para1207	parawen	parawen

see issue https://github.com/lexibank/lexibank-analysed/issues/35 -- for some it might be a question of re-generating after accepting my merge requests.

The langoid family is `None`

dataset	glottocode	language_ID	language_Name	ISO639P3code
aaleykusunda	kusu1250	KusundaGM	Gyani Maiya	kgg
aaleykusunda	kusu1250	KusundaK	Kamala	kgg
aaleykusunda	kusu1250	Kusunda	Kusunda	kgg
abrahammonpa	hrus1242	HrusoAkaJamiri	Hruso Aka Jamiri	hru
chaconcolumbian	cams1241	Kamsa	kamsá	kbh
chaconcolumbian	puin1248	Puinave	puinave	pui
chaconcolumbian	paez1247	Paez	páez	pbb
chacontukanoan	tuca1253	Prototucanoan	Proto-Tucanoan
hantganbangime	bang1363	Bangime	Bangime	dba
hubercolumbian	cams1241	Kamsa	Kamsá	kbh
hubercolumbian	puin1248	Puinave	Puinave	pui
hubercolumbian	paez1247	Paez	Páez	pbb
johanssonsoundsymbolic	abun1252	Abun	Abun	kgr
johanssonsoundsymbolic	alse1251	Alsea	Alsea	aes
johanssonsoundsymbolic	anda1286	Andaqui	Andaqui	ana
johanssonsoundsymbolic	atak1252	Atakapa	Atakapa	aqp
johanssonsoundsymbolic	bang1363	Bangime	Bangime	dba
johanssonsoundsymbolic	basq1248	Basque	Basque	eus
johanssonsoundsymbolic	bert1248	Berta	Berta	wti
johanssonsoundsymbolic	buru1296	Burushaski	Burushaski	bsk
johanssonsoundsymbolic	cand1248	CandoshiShapra	Candoshi-Shapra	cbu
johanssonsoundsymbolic	cayu1262	Cayuvava	Cayuvava	cyb
johanssonsoundsymbolic	cofa1242	Cofan	Cofán	con
johanssonsoundsymbolic	cuit1236	Cuitlatec	Cuitlatec	cuy
johanssonsoundsymbolic	esse1238	Esselen	Esselen	esq
johanssonsoundsymbolic	fasu1242	Fasu	Fasu	faa
johanssonsoundsymbolic	gaga1251	Gagadu	Gagadu	gbu
johanssonsoundsymbolic	puel1244	Gununakune	Gününa Küne	pue
johanssonsoundsymbolic	hadz1240	Hadza	Hadza	hts
johanssonsoundsymbolic	hrus1242	Hruso	Hruso	hru
johanssonsoundsymbolic	iton1250	Itonama	Itonama	ito
johanssonsoundsymbolic	kano1245	Kanoe	Kanoê	kxo
johanssonsoundsymbolic	karo1304	Karok	Karok	kyh
johanssonsoundsymbolic	klam1254	Klamath	Klamath	kla
johanssonsoundsymbolic	kunz1244	Kunza	Kunza	kuz
johanssonsoundsymbolic	kuot1243	Kuot	Kuot	kto
johanssonsoundsymbolic	kwaz1243	Kwaza	Kwaza	xwa
johanssonsoundsymbolic	lavu1241	Lavukaleve	Lavukaleve	lvk
johanssonsoundsymbolic	lule1238	Lule	Lule	ule
johanssonsoundsymbolic	maib1239	Maybrat	Maybrat	ayz
johanssonsoundsymbolic	mose1249	Moseten	Mosetén	cas
johanssonsoundsymbolic	movi1243	Movima	Movima	mzp
johanssonsoundsymbolic	muni1258	Muniche	Muniche	myr
johanssonsoundsymbolic	nara1262	Nara	Nara	nrb
johanssonsoundsymbolic	natc1249	Natchez	Natchez	ncz
johanssonsoundsymbolic	bira1253	Ongota	Ongota	bxe
johanssonsoundsymbolic	paez1247	Paez	Páez	pbb
johanssonsoundsymbolic	pele1245	PeleAta	Pele-Ata	ata
johanssonsoundsymbolic	pira1253	Piraha	Pirahã	myp
johanssonsoundsymbolic	puin1248	Puinave	Puinave	pui
johanssonsoundsymbolic	pume1238	Pume	Pumé	yae
johanssonsoundsymbolic	sali1253	Salinan	Salinan	sln
johanssonsoundsymbolic	sand1273	Sandawe	Sandawe	sad
johanssonsoundsymbolic	savo1255	Savosavo	Savosavo	svs
johanssonsoundsymbolic	seri1257	Seri	Seri	sei
johanssonsoundsymbolic	shom1245	ShomPeng	Shom Peng	sii
johanssonsoundsymbolic	sulk1246	Sulka	Sulka	sua
johanssonsoundsymbolic	sume1241	Sumerian	Sumerian	sux
johanssonsoundsymbolic	take1257	Takelma	Takelma	tkm
johanssonsoundsymbolic	taus1253	Taushiro	Taushiro	trr
johanssonsoundsymbolic	timu1245	Timucua	Timucua	tjm
johanssonsoundsymbolic	tiwi1244	Tiwi	Tiwi	tiw
johanssonsoundsymbolic	trum1247	Trumai	Trumai	tpy
johanssonsoundsymbolic	tuni1252	Tunica	Tunica	tun
johanssonsoundsymbolic	urar1246	Urarina	Urarina	ura
johanssonsoundsymbolic	wage1238	Wageman	Wageman	waq
johanssonsoundsymbolic	waor1240	Waorani	Waorani	auc
johanssonsoundsymbolic	wara1303	Warao	Warao	wba
johanssonsoundsymbolic	wash1253	Washo	Washo	was
johanssonsoundsymbolic	yama1264	Yamana	Yámana	yag
johanssonsoundsymbolic	yana1271	Yana	Yana	ynn
johanssonsoundsymbolic	yele1255	Yele	Yele	yle
johanssonsoundsymbolic	yura1255	Yuracare	Yuracaré	yuz
johanssonsoundsymbolic	zuni1245	Zuni	Zuni	zun
joophonosemantic	basq1248	Basque	Basque	eus
joophonosemantic	buru1296	Burushaski	Burushaski	bsk
joophonosemantic	paez1247	Paez	Páez	pbb
joophonosemantic	sand1273	Sandawe	Sandawe	sad
joophonosemantic	wara1303	Warao	Warao	wba
joophonosemantic	maib1239	MaiBrat	Mai Brat	ayz
northeuralex	buru1296	bsk	Burushaski	bsk
northeuralex	basq1248	eus	Basque	eus
pharaocoracholaztecan	utoa1244	ProtoUtoAztecan	PUA
transnewguineaorg	abun1252	abun	Abun	kgr
transnewguineaorg	abun1252	abun-jembun	Abun (Jembun Dialect)	kgr
transnewguineaorg	abun1252	abun-senopi	Abun (Senopi Dialect)	kgr
transnewguineaorg	boga1247	bogaya	Bogaya	boq
transnewguineaorg	burm1264	burmeso	Burmeso	bzu
transnewguineaorg	dama1272	damal	Damal	uhn
transnewguineaorg	demm1245	dem	Dem	dem
transnewguineaorg	dibi1240	dibiyaso	Dibiyaso	dby
transnewguineaorg	duna1248	duna	Duna	duc
transnewguineaorg	else1239	elseng	Elseng	mrf
transnewguineaorg	fasu1242	fasu	Fasu	faa
transnewguineaorg	kaki1249	kaki-ae	Kaki Ae	tbd
transnewguineaorg	kapo1250	kapauri	Kapauri	khp
transnewguineaorg	kehu1238	keuw	Keuw	khh
transnewguineaorg	kibi1239	kibiri	Kibiri	prm
transnewguineaorg	kolp1236	kol	Kol	kol
transnewguineaorg	kuot1243	kuot	Kuot	kto
transnewguineaorg	lavu1241	lavukaleve	Lavukaleve	lvk
transnewguineaorg	maib1239	mai-brat	Mai Brat	ayz
transnewguineaorg	mawe1251	mawes	Mawes	mgk
transnewguineaorg	maib1239	maybrat	Maybrat	ayz
transnewguineaorg	touo1238	mbaniata	Mbaniata	tqu
transnewguineaorg	bilu1245	mbaniata-lokuru	Mbaniata (Lokuru Dialect)	blb
transnewguineaorg	bilu1245	mbilua	Mbilua	blb
transnewguineaorg	bilu1245	mbilua-ndovele	Mbilua (Ndovele Dialect)	blb
transnewguineaorg	molo1262	molof	Molof	msl
transnewguineaorg	morb1239	mor	Mor	moq
transnewguineaorg	moro1289	morori	Morori	mok
transnewguineaorg	mpur1239	mpur	Mpur	akc
transnewguineaorg	mpur1239	mpur-arfu	Mpur (Arfu Dialect)	akc
transnewguineaorg	mpur1239	mpur-kebar	Mpur (Kebar Dialect)	akc
transnewguineaorg	yale1246	nagatiman	Nagatiman	nce
transnewguineaorg	fasu1242	namumi	Fasu (Namumi Dialect)	faa
transnewguineaorg	odia1239	odiai	Odiai	bhf
transnewguineaorg	papi1255	papi	Papi	ppe
transnewguineaorg	pawa1255	pawaia	Pawaia	pwa
transnewguineaorg	nucl1580	proto-eleman	Proto-Eleman
transnewguineaorg	koia1260	proto-koiarian	Proto-Koiarian
transnewguineaorg	kwal1257	proto-kwalean	Proto-Kwalean
transnewguineaorg	lake1255	proto-lakes-plain	Proto-Lakes-Plain
transnewguineaorg	lowe1437	proto-lower-sepik	Proto-Lower-Sepik
transnewguineaorg	manu1261	proto-manubaran	Proto-Manubaran
transnewguineaorg	nduu1242	proto-ndu	Proto-Ndu
transnewguineaorg	nucl1709	proto-trans-new-guinea	Proto-Trans-New-Guinea
transnewguineaorg	pura1257	purari	Purari	iar
transnewguineaorg	pyuu1245	pyu	Pyu	pby
transnewguineaorg	saus1247	sause	Sause	sao
transnewguineaorg	savo1255	savosavo	Savosavo	svs
transnewguineaorg	tabo1241	tabo	Tabo	knv
transnewguineaorg	tana1288	tanahmerah	Tanahmerah	tcm
transnewguineaorg	usku1243	usku	Usku	ulf
transnewguineaorg	wiru1244	wiru	Wiru	wiu
transnewguineaorg	yetf1238	yetfa	Yetfa	yet
utoaztecan	coah1252	Coahuilteco	Coahuilteco	xcw
utoaztecan	coto1248	Cotaname	Cotaname	xcn
utoaztecan	kara1289	Karankawa	Karankawa	zkk
utoaztecan	kere1287	ProtoKeresan	Proto-Keresan
utoaztecan	zuni1245	Zuni	Zuni	zun

The language table does not give any glottocode

dataset	language_ID	language_Name	ISO639P3code
backstromnorthernpakistan	Shimshal	Shimshal
backstromnorthernpakistan	Chapursan	Chapursan
backstromnorthernpakistan	Gupis	Gupis
backstromnorthernpakistan	Gahorabad	Gahorabad
backstromnorthernpakistan	DashkinAstor	Dashkin (Astor)
backstromnorthernpakistan	KachuraJel	Kachura (Jel)
backstromnorthernpakistan	Gultari	Gultari
bdpa	Chimborazo	Chimborazo
bdpa	Tena	Tena
bdpa	Inkawasi	Inkawasi
bdpa	Cajamarca	Cajamarca
bdpa	Corongo	Corongo
bdpa	Caraz	Caraz
bdpa	Chavin	Chavín
bdpa	Huancayo	Huancayo
bdpa	Huancavelica	Huancavelica
bdpa	Cuzco	Cuzco
bdpa	Puno	Puno
bdpa	Taquile	Taquile
bdpa	Apolobamba	Apolobamba
bdpa	Cochabamba	Cochabamba
bdpa	Sucre	Sucre
bdpa	Kawki	Kawki
bdpa	Jaqaru	Jaqaru
bdpa	Huancane	Huancané
bdpa	Tiwanaku	Tiwanaku
bdpa	Oruro	Oruro
bdpa	Dashi	Dàshí
bdpa	Gongxing	Gōngxìng
bdpa	Jinxing	Jīnxīng
bdpa	Mazhelong	Mǎzhělóng
bdpa	AmericanEnglish	American English
bdpa	CanadianEnglish	Canadian English
bdpa	CentralGermanCologne	Central German (Cologne)
bdpa	CentralGermanHonigberg	Central German (Honigberg)
bdpa	CentralGermanLuxembourg	Central German (Luxembourg)
bdpa	CentralGermanMurrhardt	Central German (Murrhardt)
bdpa	Danish	Danish
bdpa	DutchAntwerp	Dutch (Antwerp)
bdpa	BelgianDutch	Belgian Dutch
bdpa	DutchLimburg	Dutch (Limburg)
bdpa	DutchOstend	Dutch (Ostend)
bdpa	Dutch	Dutch
bdpa	NewZealandEnglishAuckland	New Zealand English (Auckland)
bdpa	EnglishBuckie	English (Buckie)
bdpa	IndianEnglishDelhi	Indian English (Delhi)
bdpa	NigerianEnglishIgbo	Nigerian English (Igbo)
bdpa	SouthAfricanEnglishJohannisburg	South African English (Johannisburg)
bdpa	EnglishLindisfarne	English (Lindisfarne)
bdpa	EnglishLiverpool	English (Liverpool
bdpa	EnglishLondon	English (London
bdpa	EnglishNorthCarolina	English (North Carolina)
bdpa	AustralianEnglishPerth	Australian English (Perth)
bdpa	EnglishSingapore	English (Singapore)
bdpa	English	English
bdpa	EnglishTyrone	English (Tyrone)
bdpa	Faroese	Faroese
bdpa	German	German
bdpa	HighGermanNorthAlsace	High German (North Alsace)
bdpa	HighGermanBiel	High German (Biel)
bdpa	HighGermanBodensee	High German (Bodensee)
bdpa	HighGermanGraubuenden	High German (Graubuenden)
bdpa	HighGermanHerrlisheim	High German (Herrlisheim)
bdpa	HighGermanOrtisei	High German (Ortisei)
bdpa	HighGermanTuebingen	High German (Tuebingen)
bdpa	HighGermanWalser	High German (Walser)
bdpa	Icelandic	Icelandic
bdpa	LowGermanAchterhoek	Low German (Achterhoek)
bdpa	LowGermanBargstedt	Low German (Bargstedt)
bdpa	NorwegianStavanger	Norwegian (Stavanger)
bdpa	Scottish	Scottish
bdpa	SwedishSkane	Swedish (Skane)
bdpa	SwedishStockholm	Swedish (Stockholm)
bdpa	WestFrisianGrou	West Frisian (Grou)
bdpa	YiddishNewYork	Yiddish (New York)
bdpa	ProtoGermanic	Proto-Germanic
bdpa	NorthMansi	North Mansi
bdpa	MiddleLozvaMansi	Middle Lozva Mansi
bdpa	LowerLozvaMansi	Lower Lozva Mansi
bdpa	KondaMansi	Konda Mansi
bdpa	TavdaMansi	Tavda Mansi
bdpa	UpperDemjankaKhanti	Upper Demjanka Khanti
bdpa	KondaKhanti	Konda Khanti
bdpa	NizjamKhanti	Nizjam Khanti
bdpa	SherkaliKhanti	Sherkali Khanti
bdpa	VakhKhanti	Vakh Khanti
bdpa	VerkhneKalimskKhanti	Verkhne Kalimsk Khanti
bdpa	VasjuganKhanti	Vasjugan Khanti
bdpa	VartovskojeKhanti	Vartovskoje Khanti
bdpa	LikrisovskojeKhanti	Likrisovskoje Khanti
bdpa	MalyjJuganKhanti	Malyj Jugan Khanti
bdpa	TremjuganKhanti	Tremjugan Khanti
bdpa	JuganKhanti	Jugan Khanti
bdpa	KazimKhanti	Kazim Khanti
bdpa	SinjaKhanti	Sinja Khanti
bdpa	ObdorskKhanti	Obdorsk Khanti
bdpa	PelimkaMansi	Pelimka Mansi
bdpa	Italian	Italian
bdpa	French	French
bdpa	Occitan	Occitan
bdpa	Ligurian	Ligurian
bdpa	LombardWest	Lombard (West)
bdpa	LombardEast	Lombard (East)
bdpa	Ladino	Ladino
bdpa	Venetian	Venetian
bowernpny	Gairi	Gairi
bowernpny	JaruMcC	Jaru-McC
bowernpny	Karree	Karree
bowernpny	KukuYalanjiCurr	KukuYalanjiCurr
bowernpny	Kungadutyi	Kungadutyi
bowernpny	MangalaMcK	MangalaMcK
bowernpny	MangalaNW	MangalaNW
bowernpny	MaryRiverandBunyaBunyaCountry	Mary River and Bunya Bunya Country
bowernpny	MountFreelingDiyari	Mount Freeling Diyari
bowernpny	MudburraMcC	Mudburra-McC
bowernpny	NggoiMwoi	Ng'goi Mwoi
bowernpny	WalmajarriBilliluna	WalmajarriBilliluna
bowernpny	WalmajarriHR	WalmajarriHR
bowernpny	WalmajarriNW	WalmajarriNW
bowernpny	WangkumaraMcDWur	WangkumaraMcDWur
chenhmongmien	WesternQiandong	Qiandong, West
chindialectsurvey	TaungthaWethet-T-1	Taungtha (Wethet)	rtc
chindialectsurvey	ThaiphumRengkheng-T-7	Thaiphum (Rengkheng)	cth
chindialectsurvey	DoituHetsawlay-U-11	Doitu (Hetsawlay)	csj
chindialectsurvey	LaituKhuasung-U-12	Laitu (Khuasung)	clj
chindialectsurvey	LaisawThuHtayKung-A	Laisaw Thu Htay Kung	clj
chindialectsurvey	SonglaiHettui8KarchaungHettui-A	Songlai-Hettui 8Karchaung (Hettui)	csj
chindialectsurvey	SonglaiMaungUmSong1MaungUmSong-A	Songlai-Maung Um (Song) 1Maung Um (Song)	csj
chindialectsurvey	LaituAhongdong-A	Laitu Ahongdong	clj
chindialectsurvey	KaangKruk-A	Kaang Kruk	ckn
chingelong	Gelong	Gelong
deepadungpalaung	ChuDongGua	Chu Dong Gua
deepadungpalaung	ChaYeQing	Cha Ye Qing
deepadungpalaung	NamHsan	Namhsan
deepadungpalaung	KhunHawt	Khun Hawt
deepadungpalaung	HtanHsan	Htan Hsan
deepadungpalaung	PangKham	Pangkham
deepadungpalaung	ManLoi	Man Loi
deepadungpalaung	NyaungGone	Nyaung Gone
deepadungpalaung	BanPaw	Ban Paw
deepadungpalaung	NoeLae	Noe Lae
deepadungpalaung	PongNuea	Pong Nuea
duonglachi	BanPhungLaChi	La Chí Bản Phùng
duonglachi	NungDinLaChi	Nùng Dín
felekesemitic	Gogot	Gogot
felekesemitic	Oromo	Oromo
gaotb	MenbaCuona	Menba (Cuona)
gaotb	MenbaMotuo	Menba (Motuo)
gaotb	YiMile	Yi (Mile)
gaotb	BikaHani	Hani (Bika)
gaotb	HayaHani	Hani (Haya)
gaotb	HaobaiHani	Hani (Haobai)
gerarditupi	Tenharim	Tenharim
gerarditupi	WayampiJ	Wayampí J
gerarditupi	GuaraniAntigo	Guarani Antigo
hsiuhmongmien	NaMeoTuyenQuang	Na Meo (Tuyen Quang)
hsiuhmongmien	Zhenmin	Zhenmin
hsiuhmongmien	Guncen	Guncen
hsiuhmongmien	Datu	Datu
hsiuhmongmien	Yangpai	Yangpai
hsiuhmongmien	Xiangao	Xiang’ao
hsiuhmongmien	WesternQiandong	Heba
hsiuhmongmien	Baixing	Baixing
kleinewillinghoeferbikwinjen	Joole	Joole
leejaponic	MiddleJapanese	Middle Japanese
leejaponic	Nara	Nara
bremerberta	BelejeGonfoye	Beleje Gonfoye
leekoreanic	Gangwon	Gangwon
peirosaustroasiatic	Bahnar	Bahnar	bdq
peirosaustroasiatic	Hadang	Hadang
peirosaustroasiatic	Hre	Hre	hre
peirosaustroasiatic	Je	Je
peirosaustroasiatic	Kadong	Kadong
peirosaustroasiatic	Ma1	Ma1
peirosaustroasiatic	Ma2	Ma2
peirosaustroasiatic	Panong	Panong
peirosaustroasiatic	Veh	Veh
peirosaustroasiatic	Bru	Bru	bru
peirosaustroasiatic	BruVK	BruVK	xhv
peirosaustroasiatic	Dakkang	Dakkang
peirosaustroasiatic	Kantu	Kantu
peirosaustroasiatic	Mak	Mak
peirosaustroasiatic	Neu	Neu
peirosaustroasiatic	Ong	Ong
peirosaustroasiatic	Taoih	Taoih
peirosaustroasiatic	Iduh	Iduh
peirosaustroasiatic	Khmu	Khmu	kjg
peirosaustroasiatic	KxinhMul	KxinhMul
peirosaustroasiatic	Pray	Pray	pry
peirosaustroasiatic	Paliu	Paliu
peirosaustroasiatic	Gantang	Gantang
peirosaustroasiatic	Guanshuang	Guanshuang
peirosaustroasiatic	Khamet	Khamet
peirosaustroasiatic	Khme	Khme
peirosaustroasiatic	Mane	Man'e
peirosaustroasiatic	Mangan	Mang'an
peirosaustroasiatic	Pangpin	Pangpin
peirosaustroasiatic	Plang	Plang
peirosaustroasiatic	Shuangdiang	Shuangdiang
peirosaustroasiatic	Wa	Wa
peirosaustroasiatic	Yongde	Yongde
peirosaustroasiatic	Arem	Arem
peirosaustroasiatic	Cuoi	Cuoi
peirosaustroasiatic	KhaPhong	KhaPhong
peirosaustroasiatic	Liha	Liha
peirosaustroasiatic	MuongKoi	MuongKoi
peirosaustroasiatic	PhongV	Phong(V)
peirosaustroasiatic	Tuum	Tuum
peirosaustroasiatic	ThoMon	Tho Mon
savelyevturkic	CodexCumanicus	Cuman
servamalagasy	AntandroyAmbovombe	Antandroy
servamalagasy	MikeaAmpoakafo	Mikea
servamalagasy	BetsimisarakaFenoarivo-Est	Betsimisaraka
servamalagasy	SakalavaMaintirano	Sakalava
servamalagasy	SakalavaMahajanga	Sakalava
servamalagasy	AntaimoroManakara	Antaimoro
servamalagasy	AntambohoakaMananjary	Antambohoaka
servamalagasy	AntaisakaVangaindrano	Antaisaka
servamalagasy	BetsileoAmbositra	Betsileo
servamalagasy	BetsileoAmbalavao	Betsileo
servamalagasy	AntanalanaItampolo	Antanalana
servamalagasy	AntanosyBezaha	Antanosy
servamalagasy	TanalaIfanadiana	Tanala
servamalagasy	AntanalanaManorofify	Antanalana
servamalagasy	AntandroyToliara	Antandroy
servamalagasy	AntanalanaAnakao	Antanalana
servamalagasy	NosyBorahaAmbodifotatra	Nosy Boraha
servamalagasy	AntanosyBelamoty	Antanosy
servamalagasy	AntandroyTsihombe	Antandroy
suntb	AmdoTibetanBlabrang	Tibetan (Amdo:Bla-brang)
suntb	AmdoTibetanZeku	Tibetan (Amdo:Zeku)
transnewguineaorg	magi-musak	Magɨ
transnewguineaorg	proto-eleman-koriki	Proto-Eleman-Koriki
transnewguineaorg	proto-isumrud	Proto-Isumrud
transnewguineaorg	proto-north-adelbert	Proto-North-Adelbert
transnewguineaorg	proto-pihom	Proto-Pihom
transnewguineaorg	proto-sub-rai	Proto-Sub-Rai
wangbai	Dashi	Dashi
wangbai	Jinxing	Jinxing
wangbai	Mazhelong	Mazhelong
wangbai	Gongxing	Gongxing
wangbai	ProtoBai	Proto-Bai
yanglalo	ProtoLalo	Proto-Lalo
yangyi	Ghomozo	Ghomozo
yangyi	EDiaocao	E-Diaocao
yangyi	EHoushan	E-Houshan
yangyi	ETaoshu	E-Taoshu
yangyi	SEGaoping	SE-Gaoping
yangyi	Nise	Nise
yangyi	Noso	Noso
yangyi	LopeAwuChen2010	Lope (Awu)
yangyi	LopeAwuYYFC1983	Lope (Awu)2
yangyi	Lidim	Lidim (Tianba)
zhivlovobugrian	lowerlozvamansi	Lower Lozva Mansi
zhoubizic	ProtoBizic	Proto-Bizic
abvdoceanic	Riwo	Riwo
abvdoceanic	TesmbolUsus	Tesmbol (Usus)
abvdoceanic	SivitiBeterbuJericho	Siviti (Beterbu, Jericho)
abvdoceanic	atarxobuGunwar	ßatarxobu (Gunwar)
abvdoceanic	Najit	Najit
abvdoceanic	AlavasWowoWowo1	Alavas-Wowo (Wowo 1)
abvdoceanic	MandriFarun16291	Mandri (Farun) 162-91

20211112-13h09m_sdcorr_languages_errors.csv

LinguList commented 3 years ago

Most of the cases with missing glottocodes are not easy to fix,as we have dialects here, mainly in BDPA, etc. These were often deliberately ignored and would also not be recommended for tests.

SimonGreenhill commented 3 years ago

Yeah, the family == None's look like isolates or proto-languages so should be ignored.

I've updated servamalagasy here, the things in abvdoceanic and transnewguinea have no glottocodes as they're either proto languages glottolog doesn't believe in, or are just not in glottolog (although I figured out Riwo was a variety of Gedaged, so that should now be updated).

I'll look at peirosaustroasiatic shortly and see if I can match some of those up.

SimonGreenhill commented 3 years ago

Note that abvdoceanic will have a big update in a few days which will fix the Riwo issue

XachaB commented 2 years ago

Thanks ! I'm downloading updates regularly so I should get all the nice corrections anytime more come in.

I have two more questions:

In servamalagasi, there are several rows with the same glottocode and name in the language table. Would it be possible to adjust their names to make them distinguishable on the basis of just this pair of information ?

dataset	glottocode	language_ID	language_Name	ISO639P3code
servamalagasy	anta1255	AntankaranaVohemar	Antankarana	xmv
servamalagasy	bara1369	BaraRanohira	Bara	bhr
servamalagasy	bets1235	BetsileoAmbohimahasoa	Betsileo
servamalagasy	maha1309	MahafalyEjeda	Mahafaly
servamalagasy	meri1243	MerinaAnalavory	Merina
servamalagasy	nort2890	BetsimisarakaMahanoro	Betsimisaraka	bmm
servamalagasy	nort2890	BetsimisarakaAntsiranana	Betsimisaraka	bmm
servamalagasy	nort2890	BetsimisarakaBrickaville	Betsimisaraka	bmm
servamalagasy	nort2890	BetsimisarakaToamasina	Betsimisaraka	bmm
servamalagasy	nort2890	BetsimisarakaMananara	Betsimisaraka	bmm
servamalagasy	nort2890	BetsimisarakaMaroantsetra	Betsimisaraka	bmm
servamalagasy	saka1291	SakalavaMorondava	Sakalava	skg
servamalagasy	saka1291	SakalavaMiandrivazo	Sakalava	skg
servamalagasy	saka1291	SakalavaBesalampy	Sakalava	skg
servamalagasy	saka1291	SakalavaBeloniTsiribihina	Sakalava	skg
servamalagasy	siha1244	SihanakaMoraranoChrome	Sihanaka
servamalagasy	siha1244	SihanakaAndilamena	Sihanaka
servamalagasy	sout2920	BetsimisarakaSahavato	Betsimisaraka	bzc
servamalagasy	tsim1257	TsimihetyMampikony	Tsimihety	xmw
servamalagasy	tsim1257	TsimihetyAndapa	Tsimihety	xmw
servamalagasy	tsim1257	TsimihetyAntsohihy	Tsimihety	xmw
servamalagasy	vezo1235	VezoMorombe	Vezo
servamalagasy	vezo1235	VezoMorondava	Vezo

If not, I can use the area to distinguish them, but this is a problem I encounter only with this single dataset, hence I am checking with you.

I tried guessing when to use partial cognates and when to use full word cognate based on whether the data had words segmented into morphemes. This turns out to not be ideal. Instead, could we come up with a list (which I am guessing will be short) of language families for which we know in advance that we will need partial cognates ? Doing this manually seems to be more linguistically motivated. List of all families in lexicore attached.

LinguList commented 2 years ago

@XachaB, it is no problem to adjust the names in the lexibank dataset, all that one would need to do is tomodify the name in the lexibank dataset. But my own take on most dataset that I have worked with so far is that the ID is the better representative of languages for the purpose of plotting their names and storing them, etc., which is why the IDs are not numeric whereevery I found time to prepare this. So what I want to say: if you run into problems, since your code distinguishes languages by their name, I'd recommend to switch to the ID instead, as we also do in cltoolkit for this very reason.

LinguList commented 2 years ago

As to the cognate datasets, @XachaB, we can make a list, but be aware that partial cognates are so far mostly only done by myself, so there is a single coder, as nobody else has coded partial cognates so far. My own conviction with respect to sound correspondences is that one always would need some version of partial cognates. But in all lexicore-CogCore datasets, the cognates are typically not partial cognates. Furthermore, if you want to detect which dataset uses partial cognates, you can do so by checking if the segment_slice attribute is defined for the lexeme in CLDF, which we use to render partial cognates.

LinguList commented 2 years ago

Ah, and the final problem is: even if we NEED partial cognates, like for many ST languages, we may not HAVE them. So you'd need a list that tells you, which data comes along in segmented form. But this could in theory also be checked automatically, but you'll encounter diversity, with one dataset being segmented for the same family, and one not.

So it is not trivial, if you want to compare ACROSS datasets, what to do here. If you want to compare INSIDE datasets, I can provide all information.

XachaB commented 2 years ago

I was maybe unclear, but indeed, I am not asking if the dataset provides partial cogids (this is easy to see). I am predicting cognates using lingpy, and am wondering when to use Partial and when to use Lexstat.

Of course, I can only use Partial when there are segmented words (which is also easy to check). But using this to guess whether I "should" is sort of a problem, since as you say there may be just a few words, or just a few languages, with segmented words. And it is indeed worse when comparing, as I am doing, across datasets, as two datasets for the same family might not both have the segmentation into morphemes.

I was hoping to improve the situation by:

Making a manual list of families where it makes sense to aim for Partial (eg: ST)*
When there are consistently "+" in the words in these families, use Partial

Do you think there is some hope for this strategy ?

*: I understand your general conviction that partial cognates make more sense, but surely, there must be languages where doing whole word cognates is an acceptable approximation, and some where we just can't do without partial cognates ?

LinguList commented 2 years ago

In fact, Partial in theory yields the same results if you use it on non-segmented wordlists. So you could just say: I use Partial in all cases.

LinguList commented 2 years ago

So this would then solve your problem pragmatically.

LinguList commented 2 years ago

And if you restrict experiments to certain salient contexts, where you have limited gaps in the data, since gaps are either due to sound change, or due to morphemes missing, you could maybe even get along well with lexstat itself, even on segmented datasets.

XachaB commented 2 years ago

That's indeed a pragmatic answer !

The problem will remain of having diversity, with one dataset being segmented for the same family, and one not. In that case, we will get a terrible output if two cognates are on one side segmented and on the other unsegmented.

Should I take that as an argument for stopping cross-dataset comparisons ? I hope not, but I don't really see any obvious solutions.

XachaB commented 2 years ago

Though note that this problem will leed to under-detection, rather than over-detection. Missing some cognates (bad recall) is less of a problem than having bad precision in the specific case of sound correspondences.

LinguList commented 2 years ago

In my opinion, cross-dataset comparisons require quite some preprocessing, which makes them notoriously difficult to handle (e.g. in https://doi.org/10.12688/openreseurope.13843.2 I worked a lot of time with the data and made several concept coverage checks to get the right balance and have still lots of missing data). Cross-dataset comparison would require to derive a list of some 300 concepts which have some 80% of coverage in all datasets we select, and a mutual coverage of at least 150 words per language pair, without exceptions. So it would be a dataset we derive from the other datasets. In such a dataset, we could just delete all segmentations. But if we go this way, we should start already and make a dedicated lexibank dataset that derives the data and also adds some conversions to the original word forms, thus, similar to lexibank-analysed, but with forms. Maybe this is even the best way to go? In this way, we can also kick out low-coverage languages, etc.

LinguList commented 2 years ago

The more I think about it, the more I think we should do exactly that: make a dedicated NEW dataset using the lexibank-analysed procedure which would give us the best of the best what we have (some ~ 20 language families, high coverage among them, etc.). From there, one would then plug in your code.

XachaB commented 2 years ago

So far my code loads and processes separate datasets. There would be quite a few changes if we moved to a special smaller compounded dataset, then me running on it. Let me know in advance if you decide to do that !

Re: entirely removing segmentations, that too is an option even without needing to make a separate dataset, of course.

My preprocessing currently already ignores a lot of data (isolates, proto families, glottolog issues, loan words, etc). If you were to generate a list of concepts, I could easily dynamically pass that to further limit the set of data points I am working on (without needing to create any dedicated dataset).

For now, I have a condition where if the minimum mutual coverage in a family (across datasets) is below 100, I use the SCA method instead of lexstat+infomap. I could also just drop the family in that case, though of course that would even further reduce the amount of usable data.

XachaB commented 2 years ago

And if you restrict experiments to certain salient contexts, where you have limited gaps in the data, since gaps are either due to sound change, or due to morphemes missing, you could maybe even get along well with lexstat itself, even on segmented datasets.

I hadn't seen this suggestion. Can you clarify what these "salient contexts" would be ? In any case, deletions being because of morphology is another big problem I am encountering.

LinguList commented 2 years ago

LingPy allows to make an automatic syllabification and to derive some basic contexts, like pre-vocalic, post-vocalic, etc. In addition, one can reduce an analysis to word-initials. In addition, profiling alignments by checking how many consecutive gaps occur allows one to only derive those parts of an alignment where a sufficiently large number of columns is filled with sounds, specifically consecutively. Thus, while

a b a h a p a k a p -

would not really surprise me,

- b a h a b a k a p -

would show the loss of a whole syllable and thus unlikely result from regular sound change, at least not if you look at shallow time depths.

LinguList commented 2 years ago

As to the preprocessing: I see one danger if the code does things en passent and then processes and outputs sound correspondences. The advantage of using the admittedly new idea we pushed in lexibank-analysed is that these steps are made explicit. This helps to debug and to deal with errors directly, already when constructing the CLDF dataset from other CLDF datasets.

All code that has been written could be easily added to specific commands in such a repository, and one could make use of cltoolkit's Wordlist class, which was designed to allow for an easy integration of cldf datasets from different sources.

BTW: I am not sure how reliable the exclusion of loan words is, if it is only annotated sporadically. One should assume that correspondence patterns of low attestation would also allow us to simply exclude those cases later on?

lexibank / lexitools

Sound correspondences: using Lexstat #19

1. Trusting language information in the absence of glottocode ?

2. What is this "BookKeeping" family in glottolog ?

Book Keeping langoids

The langoid is not known by pyglottolog

The langoid family is `None`

The language table does not give any glottocode

lexibank / lexitools

Sound correspondences: using Lexstat #19

1. Trusting language information in the absence of glottocode ?

2. What is this "BookKeeping" family in glottolog ?

Book Keeping langoids

The langoid is not known by pyglottolog

The langoid family is None

The language table does not give any glottocode

The langoid family is `None`