lexibank / lexitools

Apache License 2.0
0 stars 0 forks source link

Sound correspondences: using Lexstat #19

Open XachaB opened 3 years ago

XachaB commented 3 years ago

Hi,

Background: I am writing the methods section for the sound correspondences paper. I evaluated the simple "cognate detection" method we used against the expert annotations in Lexicore datasets. The results are very poor (precision 59%, recall 34%). Since lingpy already has a class to do cognate detection properly (LexStat), I want to refactor the lexitools correspondences code to use it instead. Otherwise, I can't trust the results.

Problem:

I read the doc: LexStat expects a qlc filename in input.

However, I am reading several CLDF datasets, combining them per genus, and searching for cognates in each. I would need to pass lists of rows from different CLDF datasets (or some object wrapping dataset rows).

I will also need to disregard LexStat output whenever there was actually some expert annotation, since for each genera, some data will come from a single dataset with cognacy jugments, and some data will have been aggregated across datasets. While this seems simple (for each pair, check if it was annotated), I would like to be sure whether there isn't a trick that would make this setup a bad idea.

Questions:

Notes:

@LinguList Do you have a solution ?

LinguList commented 3 years ago

We're currently working on a new package, called cltoolkit, which reads in a cldf dataset, checks for all kinds of consistency checks regarding lexibank, and also allows to convert to a lingpy wordlist, thereby merging data from different cldf datasets, allowing also to filter.

This is probably the way to work.

You work with languages on something like a genus level, right? If so lexstat won't have a problem in scaling up.

We're still refactoring cltoolkit, but should be done so in one or two weeks, should we then share the relevant code with you? I assume, we'll then also rather quickly just make the code public, so it can be used as a normal dependency. Since cltoolkit checks when loading for segments conforming to CLTS, it is probably also useful for the code on sound correspondence detection.

XachaB commented 3 years ago

Thanks ! The package you mention sounds like what I need, and I am very interested in the code.

I am a little bit worried that the sound correspondence project keeps proving Hofstadter's law right: each time I think I have something ready, it ends up requiring a few more weeks of work. Do you think the "one or two weeks" is an optimistic estimate (and it might be a month or two instead) or will definitely be usable code in a week or two ?

If it's the former, maybe in the meantime I should just write a quick thing that does the work (in the worst case scenario, writing and reading from a bunch of temporary files), so that I can make progress.

LinguList commented 3 years ago

@xrotwang is now checking the code to make sure that fundamental problems do not occur, and we'll use the package as a backbone for our lexibank study, which we plan to submit soon.

You can check already now, but it is private still, which is why I'd recommend to wait.

But to make a lexstat analysis for cognate set of more than one CLDF dataset, an intermediate code is also as simple as:

idx = 1
namespace = (("id", "lexibank_id"), ("language_id", "doculect"), ("concept_concepticon_gloss", "concept"), ("segments", "tokens"), ("language_glottocode", "glottocode"))
D = {0: [x[1] for x in namespace]}
for path2cldf in paths2cldf:
    wl = Wordlist.from_cldf(path2cldf, columns=[x[0] for x in namespace], namespace=namespace)
    for idx_ in wl:
        if wl[idx_, "concept"]:
            D[idx] = [wl[idx_, col[1]] for col in namespace]
            idx += 1
wl = Wordlist(D)
print(wl.height, wl.width)
LinguList commented 3 years ago

Just tested this with:

paths2cldf = ["allenbai/cldf/cldf-metadata.json", "wangbai/cldf/cldf-metadata.json"]
XachaB commented 3 years ago

Thanks for these tips ! If I need a LexStat instance rather than a Wordlist, can I do LexStat.from_cldf instead ?

LinguList commented 3 years ago

I recommend to use that afterwards: you say lex = LexStat(D), since lexstat does internal conversions so it would be faster to load in this way. You can use the code also to group by families, and the like, of course: the Dictionary representation is the internal representation of a LingPy.Wordlist, so you fill this as I have shown, and you can then initiate it with LexStat, Wordlist, and any other class derived from Wordlist.

XachaB commented 3 years ago

Ah great, I did not see that LexStat could be initialized from a dictionary of rows ! That is perfect. Looking at the source code it seemed to always need only a file path.

I think this is enough so that I can start writing something: load datasets using Wordlist.from_cldf, do any filtering and transformations I need, construct dictionaries for each genera, then initialize LexStat with the dicts, etc.

XachaB commented 3 years ago

As a consequence of finding that cognate detection in our previous setup was insufficient to be able to claim anything, I have made a big update to the correspondence code. It is now in the branch SdCorrespWithLexStat.

Changes:

I will very soon run this on all of lexicore on a server (it has gotten too slow to run on my machine), and should be back with some news when I have a first result. Once I'm sure it runs on everything smoothly, I'll make a PR, no need to inspect the code before then.

@LinguList , @SimonGreenhill , @erichround , all my apologies for taking so much time to make progress on this: unfortunately, this is quite a large amount of work, and I can only work on it part of the time.

I am much, much, more familiar with Lingpy now, which is always a nice perk ;)

LinguList commented 3 years ago

Sounds cool, as this is what I wanted to know how it works for a long time, especially when including the sound correspondence patterns to look for the most stable / most frequently recurring patterns :)

SimonGreenhill commented 3 years ago

Awesome! btw, let me know if you want to run it on the jena cluster. Looking forward to some results :)

XachaB commented 3 years ago

Thanks. For now I'm trying to run it on the Surrey cluster, maybe I'll ask if/when I use up all my allocated resources there !

Right now I keep blowing up the memory I request (64G over 10 CPU is still not enough). I'll see if I can either raise the memory request, or lower the number of CPUs, or maybe rewrite the thing to be parallel in a dumber way, that is to say, first export a csv for each family, then run an entirely separate process on each family.

I unfortunately have to time box this to Fridays (...and weekends), so I should only get back to it in a week.

XachaB commented 3 years ago

I have some little issues happening with Glottolog, maybe one of you @LinguList @SimonGreenhill have the answer ?

1. Trusting language information in the absence of glottocode ?

First, so far I have been ignoring entirely languages given with no glottocodes:

https://github.com/lexibank/lexitools/blob/d0bdd1a4b2ff7b7fe6b64ecd490207d7b64ff876/src/lexitools/commands/correspondences.py#L293

But I am starting to understand that this is a really frequent occurence, and it seems like a shame to throw out so much data.

The first alternative would be to keep those languages whenever there is a "Family" column in the language table. However, I thought the point of having glottocodes in cldf was to be certain that these types of info are properly standardized, and by using whatever is given as "Family", I open the door to non-meaningful variation. I can always normalize case, but we know that it is still possible that this would lead to duplication of families etc.

A third possibility would be to produce a list of everywhere with missing glottocodes and put some annotator efforts (if you still have time for people doing this) towards filling them in. However, I suspect that in many cases the languages might not be mappable to glottocodes. In that case, what is the best solution ?

What is your point of view on this ?

2. What is this "BookKeeping" family in glottolog ?

Glottolog has a family called "BookKeeping":

https://glottolog.org/resource/languoid/id/book1242

I imagine this is some meta-info for glottolog maintainers ? But why is it given as a family ? For example, https://github.com/lexibank/chindialectsurvey has several varieites classified as "BookKeeping" (in fact Sino-Tibetan), and this false family leads my script to attempt to find cognates in a set of unrelated languages (while keeping them from the rest of their families).

Any idea how to deal with this ? Is the info hidden elsewhere ? Should I fall back on whatever the dataset specifies (see problem in 1.) in that case ?

xrotwang commented 3 years ago

Ad 1) I'd much prefer putting in some effort to fill the gaps - i.e. assign (or even mint) missing glottocodes. I would want the intuition "no Glottocode => unreliable language info" to be valid.

Ad 2) "Bookkeeping" is a drawer for putative languages in Glottolog that turned out not to be "real". Since Glottolog has a policy of never expiring language-level Glottocodes, these need to be put somewhere. Now, the way Glottolog (technically) groups languages is via file-system directories, and this mechanism is used for both, pseudo-families and families. This is somewhat unfortunate, but might be alleviated with more docs and some help from pyglottolog, see https://pyglottolog.readthedocs.io/en/latest/languoids.html#pyglottolog.languoids.Languoid.category

chrzyki commented 3 years ago

Hi @XachaB,

thanks for your work on these tools and analyses.

Regarding 1:

I'd be interested to learn how many cases like this (i.e. no Glottocodes) come up for your particular configuration? Also I'm not sure I fully understand how a case of Family exists but Glottocode not exists could happen? Maybe I'm misunderstanding something here. At any rate, I could certainly help with fixing Glottocode issues.

Regarding 2:

Bookkeeping is also explained in some detail here. (Just saw Robert's comment, so I'm keeping this short)

Regarding cluster:

If you'd like I could also help with setting this up here. Plenty of RAM to work with.

XachaB commented 3 years ago

Thanks for your quick answers !

1a. I can easily generate a table of all languages & datasets which are missing glottocodes, will do this Friday if I can. Maybe this can be a first step towards fixing them.

1b. @chrzyki Regarding how I might be able to recover Family where there is no Glottocode: I mean that the language table sometimes specifies a Family for each language, so I could get this information without querying glottolog. Ex, many rows without glottocodes but with a family here:

https://github.com/lexibank/chindialectsurvey/blob/master/cldf/languages.csv

Of course, this is really not as good as getting the family from glottolog, which ensures that everything is standardized.

2) So, does that mean I should simply ignore all of the data which is in "bookkeeping" ? I understand that these langoids are not seen as real by Glottolog editors (and maybe the wider linguistic community), but we do have lexibank datasets with these langoids. For example, datasets such as joophonosemantic, gaotb, chindialectsurvey, servamalagasy have languages classified as BookKeeping.

Re: server, thanks for the offer, I'll definitely follow up on it if I need !

xrotwang commented 3 years ago

Yes, I think data for "bookkeeping" languages should be ignored. Ideally, it should be possible (and - again ideally - not too hard) to figure out matching non-bookkeeping glottocodes, because often there are associated ISO change requests which recommend remedies.

XachaB commented 3 years ago

Noted, then I'll make a list of both languages without glottocodes & languages with bookkeeping codes, as both require manual annotations.

I have to say, I find it exciting that writing analyses at this larger scale helps improve the individual datasets. :)

SimonGreenhill commented 3 years ago

Agree with everything above, happy to hunt for glottocodes too.

Re bookkeeping -- I wonder how many of these can be 'fixed' (i.e. they are languages we have data from so they're not spurious)

XachaB commented 3 years ago

Here I am with a list of datasets with missing glottocodes, or glottocodes with some issues. For the full list see the attached file.

In total this amounts to 424 languages which currently are ignored by the lexitools/correspondences tool.

Here it is split by problem:

Book Keeping langoids

dataset glottocode language_ID language_Name ISO639P3code
chindialectsurvey wela1234 RawngtuWeilong-A Rawngtu Weilong weu
chindialectsurvey wela1234 RawngtuRamtim-A Rawngtu Ramtim weu
gaotb yuan1242 MojiangYi Yi (Mojiang) yym
gaotb naxi1246 LijangNaxi Naxi (Lijiang) nbf
gaotb naxi1246 YongningNaxi Naxi (Yongning) nbf
johanssonsoundsymbolic lenc1244 Lencasalvador Lenca-Salvador len
johanssonsoundsymbolic sana1281 Sanapanaangaite Sanapaná (Angaité) sap
joophonosemantic chua1256 ChuanqiandianClusterMiao Chuanqiandian Cluster Miao cqd
servamalagasy sout3125 BetsimisarakaMarolambo Betsimisaraka bjq

The langoid is not known by pyglottolog

dataset glottocode language_ID language_Name ISO639P3code
dunnaslian sema1250 Semnam_Malau Semnam Malau ssm
dunnaslian teim1246 Temiar_Perak Temiar Perak tea
dunnaslian monn1258 Mon Mon mnw
kesslersignificance nucl1201 Turkish Turkish  
polyglottaafricana maka1261 MakhuwaMeetto Makhuwa-Meetto  
saenkoromance vall1248 valladerromansh Vallader_Romansh  
saenkoromance cagl1238 campidanese Campidanese  
servamalagasy meri1291 MerinaMaevatanana Merina  
sidwellbahnaric kass1248 kasseng Kasseng  
transnewguineaorg cent2257 proto-central-sogeram Proto-Central-Sogeram  
zgraggenmadang sali1249 maia-saki maia-saki  
zgraggenmadang para1207 parawen parawen  

see issue https://github.com/lexibank/lexibank-analysed/issues/35 -- for some it might be a question of re-generating after accepting my merge requests.

The langoid family is None

dataset glottocode language_ID language_Name ISO639P3code
aaleykusunda kusu1250 KusundaGM Gyani Maiya kgg
aaleykusunda kusu1250 KusundaK Kamala kgg
aaleykusunda kusu1250 Kusunda Kusunda kgg
abrahammonpa hrus1242 HrusoAkaJamiri Hruso Aka Jamiri hru
chaconcolumbian cams1241 Kamsa kamsá kbh
chaconcolumbian puin1248 Puinave puinave pui
chaconcolumbian paez1247 Paez páez pbb
chacontukanoan tuca1253 Prototucanoan Proto-Tucanoan  
hantganbangime bang1363 Bangime Bangime dba
hubercolumbian cams1241 Kamsa Kamsá kbh
hubercolumbian puin1248 Puinave Puinave pui
hubercolumbian paez1247 Paez Páez pbb
johanssonsoundsymbolic abun1252 Abun Abun kgr
johanssonsoundsymbolic alse1251 Alsea Alsea aes
johanssonsoundsymbolic anda1286 Andaqui Andaqui ana
johanssonsoundsymbolic atak1252 Atakapa Atakapa aqp
johanssonsoundsymbolic bang1363 Bangime Bangime dba
johanssonsoundsymbolic basq1248 Basque Basque eus
johanssonsoundsymbolic bert1248 Berta Berta wti
johanssonsoundsymbolic buru1296 Burushaski Burushaski bsk
johanssonsoundsymbolic cand1248 CandoshiShapra Candoshi-Shapra cbu
johanssonsoundsymbolic cayu1262 Cayuvava Cayuvava cyb
johanssonsoundsymbolic cofa1242 Cofan Cofán con
johanssonsoundsymbolic cuit1236 Cuitlatec Cuitlatec cuy
johanssonsoundsymbolic esse1238 Esselen Esselen esq
johanssonsoundsymbolic fasu1242 Fasu Fasu faa
johanssonsoundsymbolic gaga1251 Gagadu Gagadu gbu
johanssonsoundsymbolic puel1244 Gununakune Gününa Küne pue
johanssonsoundsymbolic hadz1240 Hadza Hadza hts
johanssonsoundsymbolic hrus1242 Hruso Hruso hru
johanssonsoundsymbolic iton1250 Itonama Itonama ito
johanssonsoundsymbolic kano1245 Kanoe Kanoê kxo
johanssonsoundsymbolic karo1304 Karok Karok kyh
johanssonsoundsymbolic klam1254 Klamath Klamath kla
johanssonsoundsymbolic kunz1244 Kunza Kunza kuz
johanssonsoundsymbolic kuot1243 Kuot Kuot kto
johanssonsoundsymbolic kwaz1243 Kwaza Kwaza xwa
johanssonsoundsymbolic lavu1241 Lavukaleve Lavukaleve lvk
johanssonsoundsymbolic lule1238 Lule Lule ule
johanssonsoundsymbolic maib1239 Maybrat Maybrat ayz
johanssonsoundsymbolic mose1249 Moseten Mosetén cas
johanssonsoundsymbolic movi1243 Movima Movima mzp
johanssonsoundsymbolic muni1258 Muniche Muniche myr
johanssonsoundsymbolic nara1262 Nara Nara nrb
johanssonsoundsymbolic natc1249 Natchez Natchez ncz
johanssonsoundsymbolic bira1253 Ongota Ongota bxe
johanssonsoundsymbolic paez1247 Paez Páez pbb
johanssonsoundsymbolic pele1245 PeleAta Pele-Ata ata
johanssonsoundsymbolic pira1253 Piraha Pirahã myp
johanssonsoundsymbolic puin1248 Puinave Puinave pui
johanssonsoundsymbolic pume1238 Pume Pumé yae
johanssonsoundsymbolic sali1253 Salinan Salinan sln
johanssonsoundsymbolic sand1273 Sandawe Sandawe sad
johanssonsoundsymbolic savo1255 Savosavo Savosavo svs
johanssonsoundsymbolic seri1257 Seri Seri sei
johanssonsoundsymbolic shom1245 ShomPeng Shom Peng sii
johanssonsoundsymbolic sulk1246 Sulka Sulka sua
johanssonsoundsymbolic sume1241 Sumerian Sumerian sux
johanssonsoundsymbolic take1257 Takelma Takelma tkm
johanssonsoundsymbolic taus1253 Taushiro Taushiro trr
johanssonsoundsymbolic timu1245 Timucua Timucua tjm
johanssonsoundsymbolic tiwi1244 Tiwi Tiwi tiw
johanssonsoundsymbolic trum1247 Trumai Trumai tpy
johanssonsoundsymbolic tuni1252 Tunica Tunica tun
johanssonsoundsymbolic urar1246 Urarina Urarina ura
johanssonsoundsymbolic wage1238 Wageman Wageman waq
johanssonsoundsymbolic waor1240 Waorani Waorani auc
johanssonsoundsymbolic wara1303 Warao Warao wba
johanssonsoundsymbolic wash1253 Washo Washo was
johanssonsoundsymbolic yama1264 Yamana Yámana yag
johanssonsoundsymbolic yana1271 Yana Yana ynn
johanssonsoundsymbolic yele1255 Yele Yele yle
johanssonsoundsymbolic yura1255 Yuracare Yuracaré yuz
johanssonsoundsymbolic zuni1245 Zuni Zuni zun
joophonosemantic basq1248 Basque Basque eus
joophonosemantic buru1296 Burushaski Burushaski bsk
joophonosemantic paez1247 Paez Páez pbb
joophonosemantic sand1273 Sandawe Sandawe sad
joophonosemantic wara1303 Warao Warao wba
joophonosemantic maib1239 MaiBrat Mai Brat ayz
northeuralex buru1296 bsk Burushaski bsk
northeuralex basq1248 eus Basque eus
pharaocoracholaztecan utoa1244 ProtoUtoAztecan PUA  
transnewguineaorg abun1252 abun Abun kgr
transnewguineaorg abun1252 abun-jembun Abun (Jembun Dialect) kgr
transnewguineaorg abun1252 abun-senopi Abun (Senopi Dialect) kgr
transnewguineaorg boga1247 bogaya Bogaya boq
transnewguineaorg burm1264 burmeso Burmeso bzu
transnewguineaorg dama1272 damal Damal uhn
transnewguineaorg demm1245 dem Dem dem
transnewguineaorg dibi1240 dibiyaso Dibiyaso dby
transnewguineaorg duna1248 duna Duna duc
transnewguineaorg else1239 elseng Elseng mrf
transnewguineaorg fasu1242 fasu Fasu faa
transnewguineaorg kaki1249 kaki-ae Kaki Ae tbd
transnewguineaorg kapo1250 kapauri Kapauri khp
transnewguineaorg kehu1238 keuw Keuw khh
transnewguineaorg kibi1239 kibiri Kibiri prm
transnewguineaorg kolp1236 kol Kol kol
transnewguineaorg kuot1243 kuot Kuot kto
transnewguineaorg lavu1241 lavukaleve Lavukaleve lvk
transnewguineaorg maib1239 mai-brat Mai Brat ayz
transnewguineaorg mawe1251 mawes Mawes mgk
transnewguineaorg maib1239 maybrat Maybrat ayz
transnewguineaorg touo1238 mbaniata Mbaniata tqu
transnewguineaorg bilu1245 mbaniata-lokuru Mbaniata (Lokuru Dialect) blb
transnewguineaorg bilu1245 mbilua Mbilua blb
transnewguineaorg bilu1245 mbilua-ndovele Mbilua (Ndovele Dialect) blb
transnewguineaorg molo1262 molof Molof msl
transnewguineaorg morb1239 mor Mor moq
transnewguineaorg moro1289 morori Morori mok
transnewguineaorg mpur1239 mpur Mpur akc
transnewguineaorg mpur1239 mpur-arfu Mpur (Arfu Dialect) akc
transnewguineaorg mpur1239 mpur-kebar Mpur (Kebar Dialect) akc
transnewguineaorg yale1246 nagatiman Nagatiman nce
transnewguineaorg fasu1242 namumi Fasu (Namumi Dialect) faa
transnewguineaorg odia1239 odiai Odiai bhf
transnewguineaorg papi1255 papi Papi ppe
transnewguineaorg pawa1255 pawaia Pawaia pwa
transnewguineaorg nucl1580 proto-eleman Proto-Eleman  
transnewguineaorg koia1260 proto-koiarian Proto-Koiarian  
transnewguineaorg kwal1257 proto-kwalean Proto-Kwalean  
transnewguineaorg lake1255 proto-lakes-plain Proto-Lakes-Plain  
transnewguineaorg lowe1437 proto-lower-sepik Proto-Lower-Sepik  
transnewguineaorg manu1261 proto-manubaran Proto-Manubaran  
transnewguineaorg nduu1242 proto-ndu Proto-Ndu  
transnewguineaorg nucl1709 proto-trans-new-guinea Proto-Trans-New-Guinea  
transnewguineaorg pura1257 purari Purari iar
transnewguineaorg pyuu1245 pyu Pyu pby
transnewguineaorg saus1247 sause Sause sao
transnewguineaorg savo1255 savosavo Savosavo svs
transnewguineaorg tabo1241 tabo Tabo knv
transnewguineaorg tana1288 tanahmerah Tanahmerah tcm
transnewguineaorg usku1243 usku Usku ulf
transnewguineaorg wiru1244 wiru Wiru wiu
transnewguineaorg yetf1238 yetfa Yetfa yet
utoaztecan coah1252 Coahuilteco Coahuilteco xcw
utoaztecan coto1248 Cotaname Cotaname xcn
utoaztecan kara1289 Karankawa Karankawa zkk
utoaztecan kere1287 ProtoKeresan Proto-Keresan  
utoaztecan zuni1245 Zuni Zuni zun

The language table does not give any glottocode

dataset glottocode language_ID language_Name ISO639P3code
backstromnorthernpakistan   Shimshal Shimshal  
backstromnorthernpakistan   Chapursan Chapursan  
backstromnorthernpakistan   Gupis Gupis  
backstromnorthernpakistan   Gahorabad Gahorabad  
backstromnorthernpakistan   DashkinAstor Dashkin (Astor)  
backstromnorthernpakistan   KachuraJel Kachura (Jel)  
backstromnorthernpakistan   Gultari Gultari  
bdpa   Chimborazo Chimborazo  
bdpa   Tena Tena  
bdpa   Inkawasi Inkawasi  
bdpa   Cajamarca Cajamarca  
bdpa   Corongo Corongo  
bdpa   Caraz Caraz  
bdpa   Chavin Chavín  
bdpa   Huancayo Huancayo  
bdpa   Huancavelica Huancavelica  
bdpa   Cuzco Cuzco  
bdpa   Puno Puno  
bdpa   Taquile Taquile  
bdpa   Apolobamba Apolobamba  
bdpa   Cochabamba Cochabamba  
bdpa   Sucre Sucre  
bdpa   Kawki Kawki  
bdpa   Jaqaru Jaqaru  
bdpa   Huancane Huancané  
bdpa   Tiwanaku Tiwanaku  
bdpa   Oruro Oruro  
bdpa   Dashi Dàshí  
bdpa   Gongxing Gōngxìng  
bdpa   Jinxing Jīnxīng  
bdpa   Mazhelong Mǎzhělóng  
bdpa   AmericanEnglish American English  
bdpa   CanadianEnglish Canadian English  
bdpa   CentralGermanCologne Central German (Cologne)  
bdpa   CentralGermanHonigberg Central German (Honigberg)  
bdpa   CentralGermanLuxembourg Central German (Luxembourg)  
bdpa   CentralGermanMurrhardt Central German (Murrhardt)  
bdpa   Danish Danish  
bdpa   DutchAntwerp Dutch (Antwerp)  
bdpa   BelgianDutch Belgian Dutch  
bdpa   DutchLimburg Dutch (Limburg)  
bdpa   DutchOstend Dutch (Ostend)  
bdpa   Dutch Dutch  
bdpa   NewZealandEnglishAuckland New Zealand English (Auckland)  
bdpa   EnglishBuckie English (Buckie)  
bdpa   IndianEnglishDelhi Indian English (Delhi)  
bdpa   NigerianEnglishIgbo Nigerian English (Igbo)  
bdpa   SouthAfricanEnglishJohannisburg South African English (Johannisburg)  
bdpa   EnglishLindisfarne English (Lindisfarne)  
bdpa   EnglishLiverpool English (Liverpool  
bdpa   EnglishLondon English (London  
bdpa   EnglishNorthCarolina English (North Carolina)  
bdpa   AustralianEnglishPerth Australian English (Perth)  
bdpa   EnglishSingapore English (Singapore)  
bdpa   English English  
bdpa   EnglishTyrone English (Tyrone)  
bdpa   Faroese Faroese  
bdpa   German German  
bdpa   HighGermanNorthAlsace High German (North Alsace)  
bdpa   HighGermanBiel High German (Biel)  
bdpa   HighGermanBodensee High German (Bodensee)  
bdpa   HighGermanGraubuenden High German (Graubuenden)  
bdpa   HighGermanHerrlisheim High German (Herrlisheim)  
bdpa   HighGermanOrtisei High German (Ortisei)  
bdpa   HighGermanTuebingen High German (Tuebingen)  
bdpa   HighGermanWalser High German (Walser)  
bdpa   Icelandic Icelandic  
bdpa   LowGermanAchterhoek Low German (Achterhoek)  
bdpa   LowGermanBargstedt Low German (Bargstedt)  
bdpa   NorwegianStavanger Norwegian (Stavanger)  
bdpa   Scottish Scottish  
bdpa   SwedishSkane Swedish (Skane)  
bdpa   SwedishStockholm Swedish (Stockholm)  
bdpa   WestFrisianGrou West Frisian (Grou)  
bdpa   YiddishNewYork Yiddish (New York)  
bdpa   ProtoGermanic Proto-Germanic  
bdpa   NorthMansi North Mansi  
bdpa   MiddleLozvaMansi Middle Lozva Mansi  
bdpa   LowerLozvaMansi Lower Lozva Mansi  
bdpa   KondaMansi Konda Mansi  
bdpa   TavdaMansi Tavda Mansi  
bdpa   UpperDemjankaKhanti Upper Demjanka Khanti  
bdpa   KondaKhanti Konda Khanti  
bdpa   NizjamKhanti Nizjam Khanti  
bdpa   SherkaliKhanti Sherkali Khanti  
bdpa   VakhKhanti Vakh Khanti  
bdpa   VerkhneKalimskKhanti Verkhne Kalimsk Khanti  
bdpa   VasjuganKhanti Vasjugan Khanti  
bdpa   VartovskojeKhanti Vartovskoje Khanti  
bdpa   LikrisovskojeKhanti Likrisovskoje Khanti  
bdpa   MalyjJuganKhanti Malyj Jugan Khanti  
bdpa   TremjuganKhanti Tremjugan Khanti  
bdpa   JuganKhanti Jugan Khanti  
bdpa   KazimKhanti Kazim Khanti  
bdpa   SinjaKhanti Sinja Khanti  
bdpa   ObdorskKhanti Obdorsk Khanti  
bdpa   PelimkaMansi Pelimka Mansi  
bdpa   Italian Italian  
bdpa   French French  
bdpa   Occitan Occitan  
bdpa   Ligurian Ligurian  
bdpa   LombardWest Lombard (West)  
bdpa   LombardEast Lombard (East)  
bdpa   Ladino Ladino  
bdpa   Venetian Venetian  
bowernpny   Gairi Gairi  
bowernpny   JaruMcC Jaru-McC  
bowernpny   Karree Karree  
bowernpny   KukuYalanjiCurr KukuYalanjiCurr  
bowernpny   Kungadutyi Kungadutyi  
bowernpny   MangalaMcK MangalaMcK  
bowernpny   MangalaNW MangalaNW  
bowernpny   MaryRiverandBunyaBunyaCountry Mary River and Bunya Bunya Country  
bowernpny   MountFreelingDiyari Mount Freeling Diyari  
bowernpny   MudburraMcC Mudburra-McC  
bowernpny   NggoiMwoi Ng'goi Mwoi  
bowernpny   WalmajarriBilliluna WalmajarriBilliluna  
bowernpny   WalmajarriHR WalmajarriHR  
bowernpny   WalmajarriNW WalmajarriNW  
bowernpny   WangkumaraMcDWur WangkumaraMcDWur  
chenhmongmien   WesternQiandong Qiandong, West  
chindialectsurvey   TaungthaWethet-T-1 Taungtha (Wethet) rtc
chindialectsurvey   ThaiphumRengkheng-T-7 Thaiphum (Rengkheng) cth
chindialectsurvey   DoituHetsawlay-U-11 Doitu (Hetsawlay) csj
chindialectsurvey   LaituKhuasung-U-12 Laitu (Khuasung) clj
chindialectsurvey   LaisawThuHtayKung-A Laisaw Thu Htay Kung clj
chindialectsurvey   SonglaiHettui8KarchaungHettui-A Songlai-Hettui 8Karchaung (Hettui) csj
chindialectsurvey   SonglaiMaungUmSong1MaungUmSong-A Songlai-Maung Um (Song) 1Maung Um (Song) csj
chindialectsurvey   LaituAhongdong-A Laitu Ahongdong clj
chindialectsurvey   KaangKruk-A Kaang Kruk ckn
chingelong   Gelong Gelong  
deepadungpalaung   ChuDongGua Chu Dong Gua  
deepadungpalaung   ChaYeQing Cha Ye Qing  
deepadungpalaung   NamHsan Namhsan  
deepadungpalaung   KhunHawt Khun Hawt  
deepadungpalaung   HtanHsan Htan Hsan  
deepadungpalaung   PangKham Pangkham  
deepadungpalaung   ManLoi Man Loi  
deepadungpalaung   NyaungGone Nyaung Gone  
deepadungpalaung   BanPaw Ban Paw  
deepadungpalaung   NoeLae Noe Lae  
deepadungpalaung   PongNuea Pong Nuea  
duonglachi   BanPhungLaChi La Chí Bản Phùng  
duonglachi   NungDinLaChi Nùng Dín  
felekesemitic   Gogot Gogot  
felekesemitic   Oromo Oromo  
gaotb   MenbaCuona Menba (Cuona)  
gaotb   MenbaMotuo Menba (Motuo)  
gaotb   YiMile Yi (Mile)  
gaotb   BikaHani Hani (Bika)  
gaotb   HayaHani Hani (Haya)  
gaotb   HaobaiHani Hani (Haobai)  
gerarditupi   Tenharim Tenharim  
gerarditupi   WayampiJ Wayampí J  
gerarditupi   GuaraniAntigo Guarani Antigo  
hsiuhmongmien   NaMeoTuyenQuang Na Meo (Tuyen Quang)  
hsiuhmongmien   Zhenmin Zhenmin  
hsiuhmongmien   Guncen Guncen  
hsiuhmongmien   Datu Datu  
hsiuhmongmien   Yangpai Yangpai  
hsiuhmongmien   Xiangao Xiang’ao  
hsiuhmongmien   WesternQiandong Heba  
hsiuhmongmien   Baixing Baixing  
kleinewillinghoeferbikwinjen   Joole Joole  
leejaponic   MiddleJapanese Middle Japanese  
leejaponic   Nara Nara  
bremerberta   BelejeGonfoye Beleje Gonfoye  
leekoreanic   Gangwon Gangwon  
peirosaustroasiatic   Bahnar Bahnar bdq
peirosaustroasiatic   Hadang Hadang  
peirosaustroasiatic   Hre Hre hre
peirosaustroasiatic   Je Je  
peirosaustroasiatic   Kadong Kadong  
peirosaustroasiatic   Ma1 Ma1  
peirosaustroasiatic   Ma2 Ma2  
peirosaustroasiatic   Panong Panong  
peirosaustroasiatic   Veh Veh  
peirosaustroasiatic   Bru Bru bru
peirosaustroasiatic   BruVK BruVK xhv
peirosaustroasiatic   Dakkang Dakkang  
peirosaustroasiatic   Kantu Kantu  
peirosaustroasiatic   Mak Mak  
peirosaustroasiatic   Neu Neu  
peirosaustroasiatic   Ong Ong  
peirosaustroasiatic   Taoih Taoih  
peirosaustroasiatic   Iduh Iduh  
peirosaustroasiatic   Khmu Khmu kjg
peirosaustroasiatic   KxinhMul KxinhMul  
peirosaustroasiatic   Pray Pray pry
peirosaustroasiatic   Paliu Paliu  
peirosaustroasiatic   Gantang Gantang  
peirosaustroasiatic   Guanshuang Guanshuang  
peirosaustroasiatic   Khamet Khamet  
peirosaustroasiatic   Khme Khme  
peirosaustroasiatic   Mane Man'e  
peirosaustroasiatic   Mangan Mang'an  
peirosaustroasiatic   Pangpin Pangpin  
peirosaustroasiatic   Plang Plang  
peirosaustroasiatic   Shuangdiang Shuangdiang  
peirosaustroasiatic   Wa Wa  
peirosaustroasiatic   Yongde Yongde  
peirosaustroasiatic   Arem Arem  
peirosaustroasiatic   Cuoi Cuoi  
peirosaustroasiatic   KhaPhong KhaPhong  
peirosaustroasiatic   Liha Liha  
peirosaustroasiatic   MuongKoi MuongKoi  
peirosaustroasiatic   PhongV Phong(V)  
peirosaustroasiatic   Tuum Tuum  
peirosaustroasiatic   ThoMon Tho Mon  
savelyevturkic   CodexCumanicus Cuman  
servamalagasy   AntandroyAmbovombe Antandroy  
servamalagasy   MikeaAmpoakafo Mikea  
servamalagasy   BetsimisarakaFenoarivo-Est Betsimisaraka  
servamalagasy   SakalavaMaintirano Sakalava  
servamalagasy   SakalavaMahajanga Sakalava  
servamalagasy   AntaimoroManakara Antaimoro  
servamalagasy   AntambohoakaMananjary Antambohoaka  
servamalagasy   AntaisakaVangaindrano Antaisaka  
servamalagasy   BetsileoAmbositra Betsileo  
servamalagasy   BetsileoAmbalavao Betsileo  
servamalagasy   AntanalanaItampolo Antanalana  
servamalagasy   AntanosyBezaha Antanosy  
servamalagasy   TanalaIfanadiana Tanala  
servamalagasy   AntanalanaManorofify Antanalana  
servamalagasy   AntandroyToliara Antandroy  
servamalagasy   AntanalanaAnakao Antanalana  
servamalagasy   NosyBorahaAmbodifotatra Nosy Boraha  
servamalagasy   AntanosyBelamoty Antanosy  
servamalagasy   AntandroyTsihombe Antandroy  
suntb   AmdoTibetanBlabrang Tibetan (Amdo:Bla-brang)  
suntb   AmdoTibetanZeku Tibetan (Amdo:Zeku)  
transnewguineaorg   magi-musak Magɨ  
transnewguineaorg   proto-eleman-koriki Proto-Eleman-Koriki  
transnewguineaorg   proto-isumrud Proto-Isumrud  
transnewguineaorg   proto-north-adelbert Proto-North-Adelbert  
transnewguineaorg   proto-pihom Proto-Pihom  
transnewguineaorg   proto-sub-rai Proto-Sub-Rai  
wangbai   Dashi Dashi  
wangbai   Jinxing Jinxing  
wangbai   Mazhelong Mazhelong  
wangbai   Gongxing Gongxing  
wangbai   ProtoBai Proto-Bai  
yanglalo   ProtoLalo Proto-Lalo  
yangyi   Ghomozo Ghomozo  
yangyi   EDiaocao E-Diaocao  
yangyi   EHoushan E-Houshan  
yangyi   ETaoshu E-Taoshu  
yangyi   SEGaoping SE-Gaoping  
yangyi   Nise Nise  
yangyi   Noso Noso  
yangyi   LopeAwuChen2010 Lope (Awu)  
yangyi   LopeAwuYYFC1983 Lope (Awu)2  
yangyi   Lidim Lidim (Tianba)  
zhivlovobugrian   lowerlozvamansi Lower Lozva Mansi  
zhoubizic   ProtoBizic Proto-Bizic  
abvdoceanic   Riwo Riwo  
abvdoceanic   TesmbolUsus Tesmbol (Usus)  
abvdoceanic   SivitiBeterbuJericho Siviti (Beterbu, Jericho)  
abvdoceanic   atarxobuGunwar ßatarxobu (Gunwar)  
abvdoceanic   Najit Najit  
abvdoceanic   AlavasWowoWowo1 Alavas-Wowo (Wowo 1)  
abvdoceanic   MandriFarun16291 Mandri (Farun) 162-91  

20211112-13h09m_sdcorr_languages_errors.csv

LinguList commented 3 years ago

Most of the cases with missing glottocodes are not easy to fix,as we have dialects here, mainly in BDPA, etc. These were often deliberately ignored and would also not be recommended for tests.

SimonGreenhill commented 3 years ago

Yeah, the family == None's look like isolates or proto-languages so should be ignored.

I've updated servamalagasy here, the things in abvdoceanic and transnewguinea have no glottocodes as they're either proto languages glottolog doesn't believe in, or are just not in glottolog (although I figured out Riwo was a variety of Gedaged, so that should now be updated).

I'll look at peirosaustroasiatic shortly and see if I can match some of those up.

SimonGreenhill commented 3 years ago

Note that abvdoceanic will have a big update in a few days which will fix the Riwo issue

XachaB commented 2 years ago

Thanks ! I'm downloading updates regularly so I should get all the nice corrections anytime more come in.

I have two more questions:

  1. In servamalagasi, there are several rows with the same glottocode and name in the language table. Would it be possible to adjust their names to make them distinguishable on the basis of just this pair of information ?
dataset glottocode language_ID language_Name ISO639P3code
servamalagasy anta1255 AntankaranaVohemar Antankarana xmv
servamalagasy bara1369 BaraRanohira Bara bhr
servamalagasy bets1235 BetsileoAmbohimahasoa Betsileo  
servamalagasy maha1309 MahafalyEjeda Mahafaly  
servamalagasy meri1243 MerinaAnalavory Merina  
servamalagasy nort2890 BetsimisarakaMahanoro Betsimisaraka bmm
servamalagasy nort2890 BetsimisarakaAntsiranana Betsimisaraka bmm
servamalagasy nort2890 BetsimisarakaBrickaville Betsimisaraka bmm
servamalagasy nort2890 BetsimisarakaToamasina Betsimisaraka bmm
servamalagasy nort2890 BetsimisarakaMananara Betsimisaraka bmm
servamalagasy nort2890 BetsimisarakaMaroantsetra Betsimisaraka bmm
servamalagasy saka1291 SakalavaMorondava Sakalava skg
servamalagasy saka1291 SakalavaMiandrivazo Sakalava skg
servamalagasy saka1291 SakalavaBesalampy Sakalava skg
servamalagasy saka1291 SakalavaBeloniTsiribihina Sakalava skg
servamalagasy siha1244 SihanakaMoraranoChrome Sihanaka  
servamalagasy siha1244 SihanakaAndilamena Sihanaka  
servamalagasy sout2920 BetsimisarakaSahavato Betsimisaraka bzc
servamalagasy tsim1257 TsimihetyMampikony Tsimihety xmw
servamalagasy tsim1257 TsimihetyAndapa Tsimihety xmw
servamalagasy tsim1257 TsimihetyAntsohihy Tsimihety xmw
servamalagasy vezo1235 VezoMorombe Vezo  
servamalagasy vezo1235 VezoMorondava Vezo  

If not, I can use the area to distinguish them, but this is a problem I encounter only with this single dataset, hence I am checking with you.

  1. I tried guessing when to use partial cognates and when to use full word cognate based on whether the data had words segmented into morphemes. This turns out to not be ideal. Instead, could we come up with a list (which I am guessing will be short) of language families for which we know in advance that we will need partial cognates ? Doing this manually seems to be more linguistically motivated. List of all families in lexicore attached.
LinguList commented 2 years ago

@XachaB, it is no problem to adjust the names in the lexibank dataset, all that one would need to do is tomodify the name in the lexibank dataset. But my own take on most dataset that I have worked with so far is that the ID is the better representative of languages for the purpose of plotting their names and storing them, etc., which is why the IDs are not numeric whereevery I found time to prepare this. So what I want to say: if you run into problems, since your code distinguishes languages by their name, I'd recommend to switch to the ID instead, as we also do in cltoolkit for this very reason.

LinguList commented 2 years ago

As to the cognate datasets, @XachaB, we can make a list, but be aware that partial cognates are so far mostly only done by myself, so there is a single coder, as nobody else has coded partial cognates so far. My own conviction with respect to sound correspondences is that one always would need some version of partial cognates. But in all lexicore-CogCore datasets, the cognates are typically not partial cognates. Furthermore, if you want to detect which dataset uses partial cognates, you can do so by checking if the segment_slice attribute is defined for the lexeme in CLDF, which we use to render partial cognates.

LinguList commented 2 years ago

Ah, and the final problem is: even if we NEED partial cognates, like for many ST languages, we may not HAVE them. So you'd need a list that tells you, which data comes along in segmented form. But this could in theory also be checked automatically, but you'll encounter diversity, with one dataset being segmented for the same family, and one not.

So it is not trivial, if you want to compare ACROSS datasets, what to do here. If you want to compare INSIDE datasets, I can provide all information.

XachaB commented 2 years ago

I was maybe unclear, but indeed, I am not asking if the dataset provides partial cogids (this is easy to see). I am predicting cognates using lingpy, and am wondering when to use Partial and when to use Lexstat.

Of course, I can only use Partial when there are segmented words (which is also easy to check). But using this to guess whether I "should" is sort of a problem, since as you say there may be just a few words, or just a few languages, with segmented words. And it is indeed worse when comparing, as I am doing, across datasets, as two datasets for the same family might not both have the segmentation into morphemes.

I was hoping to improve the situation by:

  1. Making a manual list of families where it makes sense to aim for Partial (eg: ST)*
  2. When there are consistently "+" in the words in these families, use Partial

Do you think there is some hope for this strategy ?


*: I understand your general conviction that partial cognates make more sense, but surely, there must be languages where doing whole word cognates is an acceptable approximation, and some where we just can't do without partial cognates ?

LinguList commented 2 years ago

In fact, Partial in theory yields the same results if you use it on non-segmented wordlists. So you could just say: I use Partial in all cases.

LinguList commented 2 years ago

So this would then solve your problem pragmatically.

LinguList commented 2 years ago

And if you restrict experiments to certain salient contexts, where you have limited gaps in the data, since gaps are either due to sound change, or due to morphemes missing, you could maybe even get along well with lexstat itself, even on segmented datasets.

XachaB commented 2 years ago

That's indeed a pragmatic answer !

The problem will remain of having diversity, with one dataset being segmented for the same family, and one not. In that case, we will get a terrible output if two cognates are on one side segmented and on the other unsegmented.

Should I take that as an argument for stopping cross-dataset comparisons ? I hope not, but I don't really see any obvious solutions.

XachaB commented 2 years ago

Though note that this problem will leed to under-detection, rather than over-detection. Missing some cognates (bad recall) is less of a problem than having bad precision in the specific case of sound correspondences.

LinguList commented 2 years ago

In my opinion, cross-dataset comparisons require quite some preprocessing, which makes them notoriously difficult to handle (e.g. in https://doi.org/10.12688/openreseurope.13843.2 I worked a lot of time with the data and made several concept coverage checks to get the right balance and have still lots of missing data). Cross-dataset comparison would require to derive a list of some 300 concepts which have some 80% of coverage in all datasets we select, and a mutual coverage of at least 150 words per language pair, without exceptions. So it would be a dataset we derive from the other datasets. In such a dataset, we could just delete all segmentations. But if we go this way, we should start already and make a dedicated lexibank dataset that derives the data and also adds some conversions to the original word forms, thus, similar to lexibank-analysed, but with forms. Maybe this is even the best way to go? In this way, we can also kick out low-coverage languages, etc.

LinguList commented 2 years ago

The more I think about it, the more I think we should do exactly that: make a dedicated NEW dataset using the lexibank-analysed procedure which would give us the best of the best what we have (some ~ 20 language families, high coverage among them, etc.). From there, one would then plug in your code.

XachaB commented 2 years ago

So far my code loads and processes separate datasets. There would be quite a few changes if we moved to a special smaller compounded dataset, then me running on it. Let me know in advance if you decide to do that !

Re: entirely removing segmentations, that too is an option even without needing to make a separate dataset, of course.

My preprocessing currently already ignores a lot of data (isolates, proto families, glottolog issues, loan words, etc). If you were to generate a list of concepts, I could easily dynamically pass that to further limit the set of data points I am working on (without needing to create any dedicated dataset).

For now, I have a condition where if the minimum mutual coverage in a family (across datasets) is below 100, I use the SCA method instead of lexstat+infomap. I could also just drop the family in that case, though of course that would even further reduce the amount of usable data.

XachaB commented 2 years ago

And if you restrict experiments to certain salient contexts, where you have limited gaps in the data, since gaps are either due to sound change, or due to morphemes missing, you could maybe even get along well with lexstat itself, even on segmented datasets.

I hadn't seen this suggestion. Can you clarify what these "salient contexts" would be ? In any case, deletions being because of morphology is another big problem I am encountering.

LinguList commented 2 years ago

LingPy allows to make an automatic syllabification and to derive some basic contexts, like pre-vocalic, post-vocalic, etc. In addition, one can reduce an analysis to word-initials. In addition, profiling alignments by checking how many consecutive gaps occur allows one to only derive those parts of an alignment where a sufficiently large number of columns is filled with sounds, specifically consecutively. Thus, while

would not really surprise me,

would show the loss of a whole syllable and thus unlikely result from regular sound change, at least not if you look at shallow time depths.

LinguList commented 2 years ago

As to the preprocessing: I see one danger if the code does things en passent and then processes and outputs sound correspondences. The advantage of using the admittedly new idea we pushed in lexibank-analysed is that these steps are made explicit. This helps to debug and to deal with errors directly, already when constructing the CLDF dataset from other CLDF datasets.

All code that has been written could be easily added to specific commands in such a repository, and one could make use of cltoolkit's Wordlist class, which was designed to allow for an easy integration of cldf datasets from different sources.

BTW: I am not sure how reliable the exclusion of loan words is, if it is only annotated sporadically. One should assume that correspondence patterns of low attestation would also allow us to simply exclude those cases later on?