Open XachaB opened 3 years ago
We're currently working on a new package, called cltoolkit, which reads in a cldf dataset, checks for all kinds of consistency checks regarding lexibank, and also allows to convert to a lingpy wordlist, thereby merging data from different cldf datasets, allowing also to filter.
This is probably the way to work.
You work with languages on something like a genus level, right? If so lexstat won't have a problem in scaling up.
We're still refactoring cltoolkit, but should be done so in one or two weeks, should we then share the relevant code with you? I assume, we'll then also rather quickly just make the code public, so it can be used as a normal dependency. Since cltoolkit checks when loading for segments conforming to CLTS, it is probably also useful for the code on sound correspondence detection.
Thanks ! The package you mention sounds like what I need, and I am very interested in the code.
I am a little bit worried that the sound correspondence project keeps proving Hofstadter's law right: each time I think I have something ready, it ends up requiring a few more weeks of work. Do you think the "one or two weeks" is an optimistic estimate (and it might be a month or two instead) or will definitely be usable code in a week or two ?
If it's the former, maybe in the meantime I should just write a quick thing that does the work (in the worst case scenario, writing and reading from a bunch of temporary files), so that I can make progress.
@xrotwang is now checking the code to make sure that fundamental problems do not occur, and we'll use the package as a backbone for our lexibank study, which we plan to submit soon.
You can check already now, but it is private still, which is why I'd recommend to wait.
But to make a lexstat analysis for cognate set of more than one CLDF dataset, an intermediate code is also as simple as:
idx = 1
namespace = (("id", "lexibank_id"), ("language_id", "doculect"), ("concept_concepticon_gloss", "concept"), ("segments", "tokens"), ("language_glottocode", "glottocode"))
D = {0: [x[1] for x in namespace]}
for path2cldf in paths2cldf:
wl = Wordlist.from_cldf(path2cldf, columns=[x[0] for x in namespace], namespace=namespace)
for idx_ in wl:
if wl[idx_, "concept"]:
D[idx] = [wl[idx_, col[1]] for col in namespace]
idx += 1
wl = Wordlist(D)
print(wl.height, wl.width)
Just tested this with:
paths2cldf = ["allenbai/cldf/cldf-metadata.json", "wangbai/cldf/cldf-metadata.json"]
Thanks for these tips ! If I need a LexStat instance rather than a Wordlist, can I do LexStat.from_cldf
instead ?
I recommend to use that afterwards: you say lex = LexStat(D), since lexstat does internal conversions so it would be faster to load in this way. You can use the code also to group by families, and the like, of course: the Dictionary representation is the internal representation of a LingPy.Wordlist, so you fill this as I have shown, and you can then initiate it with LexStat, Wordlist, and any other class derived from Wordlist.
Ah great, I did not see that LexStat could be initialized from a dictionary of rows ! That is perfect. Looking at the source code it seemed to always need only a file path.
I think this is enough so that I can start writing something: load datasets using Wordlist.from_cldf
, do any filtering and transformations I need, construct dictionaries for each genera, then initialize LexStat with the dicts, etc.
As a consequence of finding that cognate detection in our previous setup was insufficient to be able to claim anything, I have made a big update to the correspondence code. It is now in the branch SdCorrespWithLexStat.
Changes:
cldfbench download cldfbench_lexibank_analysed.py
before running the sound correspondence script is necessary (and more convenient than the previous method),I will very soon run this on all of lexicore on a server (it has gotten too slow to run on my machine), and should be back with some news when I have a first result. Once I'm sure it runs on everything smoothly, I'll make a PR, no need to inspect the code before then.
@LinguList , @SimonGreenhill , @erichround , all my apologies for taking so much time to make progress on this: unfortunately, this is quite a large amount of work, and I can only work on it part of the time.
I am much, much, more familiar with Lingpy now, which is always a nice perk ;)
Sounds cool, as this is what I wanted to know how it works for a long time, especially when including the sound correspondence patterns to look for the most stable / most frequently recurring patterns :)
Awesome! btw, let me know if you want to run it on the jena cluster. Looking forward to some results :)
Thanks. For now I'm trying to run it on the Surrey cluster, maybe I'll ask if/when I use up all my allocated resources there !
Right now I keep blowing up the memory I request (64G over 10 CPU is still not enough). I'll see if I can either raise the memory request, or lower the number of CPUs, or maybe rewrite the thing to be parallel in a dumber way, that is to say, first export a csv for each family, then run an entirely separate process on each family.
I unfortunately have to time box this to Fridays (...and weekends), so I should only get back to it in a week.
I have some little issues happening with Glottolog, maybe one of you @LinguList @SimonGreenhill have the answer ?
First, so far I have been ignoring entirely languages given with no glottocodes:
But I am starting to understand that this is a really frequent occurence, and it seems like a shame to throw out so much data.
The first alternative would be to keep those languages whenever there is a "Family" column in the language table. However, I thought the point of having glottocodes in cldf was to be certain that these types of info are properly standardized, and by using whatever is given as "Family", I open the door to non-meaningful variation. I can always normalize case, but we know that it is still possible that this would lead to duplication of families etc.
A third possibility would be to produce a list of everywhere with missing glottocodes and put some annotator efforts (if you still have time for people doing this) towards filling them in. However, I suspect that in many cases the languages might not be mappable to glottocodes. In that case, what is the best solution ?
What is your point of view on this ?
Glottolog has a family called "BookKeeping":
https://glottolog.org/resource/languoid/id/book1242
I imagine this is some meta-info for glottolog maintainers ? But why is it given as a family ? For example, https://github.com/lexibank/chindialectsurvey has several varieites classified as "BookKeeping" (in fact Sino-Tibetan), and this false family leads my script to attempt to find cognates in a set of unrelated languages (while keeping them from the rest of their families).
Any idea how to deal with this ? Is the info hidden elsewhere ? Should I fall back on whatever the dataset specifies (see problem in 1.) in that case ?
Ad 1) I'd much prefer putting in some effort to fill the gaps - i.e. assign (or even mint) missing glottocodes. I would want the intuition "no Glottocode => unreliable language info" to be valid.
Ad 2) "Bookkeeping" is a drawer for putative languages in Glottolog that turned out not to be "real". Since Glottolog has a policy of never expiring language-level Glottocodes, these need to be put somewhere. Now, the way Glottolog (technically) groups languages is via file-system directories, and this mechanism is used for both, pseudo-families and families. This is somewhat unfortunate, but might be alleviated with more docs and some help from pyglottolog, see https://pyglottolog.readthedocs.io/en/latest/languoids.html#pyglottolog.languoids.Languoid.category
Hi @XachaB,
thanks for your work on these tools and analyses.
Regarding 1:
I'd be interested to learn how many cases like this (i.e. no Glottocodes) come up for your particular configuration? Also I'm not sure I fully understand how a case of Family exists
but Glottocode not exists
could happen? Maybe I'm misunderstanding something here. At any rate, I could certainly help with fixing Glottocode issues.
Regarding 2:
Bookkeeping is also explained in some detail here. (Just saw Robert's comment, so I'm keeping this short)
Regarding cluster:
If you'd like I could also help with setting this up here. Plenty of RAM to work with.
Thanks for your quick answers !
1a. I can easily generate a table of all languages & datasets which are missing glottocodes, will do this Friday if I can. Maybe this can be a first step towards fixing them.
1b. @chrzyki Regarding how I might be able to recover Family where there is no Glottocode: I mean that the language table sometimes specifies a Family for each language, so I could get this information without querying glottolog. Ex, many rows without glottocodes but with a family here:
https://github.com/lexibank/chindialectsurvey/blob/master/cldf/languages.csv
Of course, this is really not as good as getting the family from glottolog, which ensures that everything is standardized.
2) So, does that mean I should simply ignore all of the data which is in "bookkeeping" ? I understand that these langoids are not seen as real by Glottolog editors (and maybe the wider linguistic community), but we do have lexibank datasets with these langoids. For example, datasets such as joophonosemantic, gaotb, chindialectsurvey, servamalagasy have languages classified as BookKeeping.
Re: server, thanks for the offer, I'll definitely follow up on it if I need !
Yes, I think data for "bookkeeping" languages should be ignored. Ideally, it should be possible (and - again ideally - not too hard) to figure out matching non-bookkeeping glottocodes, because often there are associated ISO change requests which recommend remedies.
Noted, then I'll make a list of both languages without glottocodes & languages with bookkeeping codes, as both require manual annotations.
I have to say, I find it exciting that writing analyses at this larger scale helps improve the individual datasets. :)
Agree with everything above, happy to hunt for glottocodes too.
Re bookkeeping -- I wonder how many of these can be 'fixed' (i.e. they are languages we have data from so they're not spurious)
Here I am with a list of datasets with missing glottocodes, or glottocodes with some issues. For the full list see the attached file.
In total this amounts to 424 languages which currently are ignored by the lexitools/correspondences tool.
Here it is split by problem:
dataset | glottocode | language_ID | language_Name | ISO639P3code |
---|---|---|---|---|
chindialectsurvey | wela1234 | RawngtuWeilong-A | Rawngtu Weilong | weu |
chindialectsurvey | wela1234 | RawngtuRamtim-A | Rawngtu Ramtim | weu |
gaotb | yuan1242 | MojiangYi | Yi (Mojiang) | yym |
gaotb | naxi1246 | LijangNaxi | Naxi (Lijiang) | nbf |
gaotb | naxi1246 | YongningNaxi | Naxi (Yongning) | nbf |
johanssonsoundsymbolic | lenc1244 | Lencasalvador | Lenca-Salvador | len |
johanssonsoundsymbolic | sana1281 | Sanapanaangaite | Sanapaná (Angaité) | sap |
joophonosemantic | chua1256 | ChuanqiandianClusterMiao | Chuanqiandian Cluster Miao | cqd |
servamalagasy | sout3125 | BetsimisarakaMarolambo | Betsimisaraka | bjq |
dataset | glottocode | language_ID | language_Name | ISO639P3code |
---|---|---|---|---|
dunnaslian | sema1250 | Semnam_Malau | Semnam Malau | ssm |
dunnaslian | teim1246 | Temiar_Perak | Temiar Perak | tea |
dunnaslian | monn1258 | Mon | Mon | mnw |
kesslersignificance | nucl1201 | Turkish | Turkish | |
polyglottaafricana | maka1261 | MakhuwaMeetto | Makhuwa-Meetto | |
saenkoromance | vall1248 | valladerromansh | Vallader_Romansh | |
saenkoromance | cagl1238 | campidanese | Campidanese | |
servamalagasy | meri1291 | MerinaMaevatanana | Merina | |
sidwellbahnaric | kass1248 | kasseng | Kasseng | |
transnewguineaorg | cent2257 | proto-central-sogeram | Proto-Central-Sogeram | |
zgraggenmadang | sali1249 | maia-saki | maia-saki | |
zgraggenmadang | para1207 | parawen | parawen |
see issue https://github.com/lexibank/lexibank-analysed/issues/35 -- for some it might be a question of re-generating after accepting my merge requests.
None
dataset | glottocode | language_ID | language_Name | ISO639P3code |
---|---|---|---|---|
aaleykusunda | kusu1250 | KusundaGM | Gyani Maiya | kgg |
aaleykusunda | kusu1250 | KusundaK | Kamala | kgg |
aaleykusunda | kusu1250 | Kusunda | Kusunda | kgg |
abrahammonpa | hrus1242 | HrusoAkaJamiri | Hruso Aka Jamiri | hru |
chaconcolumbian | cams1241 | Kamsa | kamsá | kbh |
chaconcolumbian | puin1248 | Puinave | puinave | pui |
chaconcolumbian | paez1247 | Paez | páez | pbb |
chacontukanoan | tuca1253 | Prototucanoan | Proto-Tucanoan | |
hantganbangime | bang1363 | Bangime | Bangime | dba |
hubercolumbian | cams1241 | Kamsa | Kamsá | kbh |
hubercolumbian | puin1248 | Puinave | Puinave | pui |
hubercolumbian | paez1247 | Paez | Páez | pbb |
johanssonsoundsymbolic | abun1252 | Abun | Abun | kgr |
johanssonsoundsymbolic | alse1251 | Alsea | Alsea | aes |
johanssonsoundsymbolic | anda1286 | Andaqui | Andaqui | ana |
johanssonsoundsymbolic | atak1252 | Atakapa | Atakapa | aqp |
johanssonsoundsymbolic | bang1363 | Bangime | Bangime | dba |
johanssonsoundsymbolic | basq1248 | Basque | Basque | eus |
johanssonsoundsymbolic | bert1248 | Berta | Berta | wti |
johanssonsoundsymbolic | buru1296 | Burushaski | Burushaski | bsk |
johanssonsoundsymbolic | cand1248 | CandoshiShapra | Candoshi-Shapra | cbu |
johanssonsoundsymbolic | cayu1262 | Cayuvava | Cayuvava | cyb |
johanssonsoundsymbolic | cofa1242 | Cofan | Cofán | con |
johanssonsoundsymbolic | cuit1236 | Cuitlatec | Cuitlatec | cuy |
johanssonsoundsymbolic | esse1238 | Esselen | Esselen | esq |
johanssonsoundsymbolic | fasu1242 | Fasu | Fasu | faa |
johanssonsoundsymbolic | gaga1251 | Gagadu | Gagadu | gbu |
johanssonsoundsymbolic | puel1244 | Gununakune | Gününa Küne | pue |
johanssonsoundsymbolic | hadz1240 | Hadza | Hadza | hts |
johanssonsoundsymbolic | hrus1242 | Hruso | Hruso | hru |
johanssonsoundsymbolic | iton1250 | Itonama | Itonama | ito |
johanssonsoundsymbolic | kano1245 | Kanoe | Kanoê | kxo |
johanssonsoundsymbolic | karo1304 | Karok | Karok | kyh |
johanssonsoundsymbolic | klam1254 | Klamath | Klamath | kla |
johanssonsoundsymbolic | kunz1244 | Kunza | Kunza | kuz |
johanssonsoundsymbolic | kuot1243 | Kuot | Kuot | kto |
johanssonsoundsymbolic | kwaz1243 | Kwaza | Kwaza | xwa |
johanssonsoundsymbolic | lavu1241 | Lavukaleve | Lavukaleve | lvk |
johanssonsoundsymbolic | lule1238 | Lule | Lule | ule |
johanssonsoundsymbolic | maib1239 | Maybrat | Maybrat | ayz |
johanssonsoundsymbolic | mose1249 | Moseten | Mosetén | cas |
johanssonsoundsymbolic | movi1243 | Movima | Movima | mzp |
johanssonsoundsymbolic | muni1258 | Muniche | Muniche | myr |
johanssonsoundsymbolic | nara1262 | Nara | Nara | nrb |
johanssonsoundsymbolic | natc1249 | Natchez | Natchez | ncz |
johanssonsoundsymbolic | bira1253 | Ongota | Ongota | bxe |
johanssonsoundsymbolic | paez1247 | Paez | Páez | pbb |
johanssonsoundsymbolic | pele1245 | PeleAta | Pele-Ata | ata |
johanssonsoundsymbolic | pira1253 | Piraha | Pirahã | myp |
johanssonsoundsymbolic | puin1248 | Puinave | Puinave | pui |
johanssonsoundsymbolic | pume1238 | Pume | Pumé | yae |
johanssonsoundsymbolic | sali1253 | Salinan | Salinan | sln |
johanssonsoundsymbolic | sand1273 | Sandawe | Sandawe | sad |
johanssonsoundsymbolic | savo1255 | Savosavo | Savosavo | svs |
johanssonsoundsymbolic | seri1257 | Seri | Seri | sei |
johanssonsoundsymbolic | shom1245 | ShomPeng | Shom Peng | sii |
johanssonsoundsymbolic | sulk1246 | Sulka | Sulka | sua |
johanssonsoundsymbolic | sume1241 | Sumerian | Sumerian | sux |
johanssonsoundsymbolic | take1257 | Takelma | Takelma | tkm |
johanssonsoundsymbolic | taus1253 | Taushiro | Taushiro | trr |
johanssonsoundsymbolic | timu1245 | Timucua | Timucua | tjm |
johanssonsoundsymbolic | tiwi1244 | Tiwi | Tiwi | tiw |
johanssonsoundsymbolic | trum1247 | Trumai | Trumai | tpy |
johanssonsoundsymbolic | tuni1252 | Tunica | Tunica | tun |
johanssonsoundsymbolic | urar1246 | Urarina | Urarina | ura |
johanssonsoundsymbolic | wage1238 | Wageman | Wageman | waq |
johanssonsoundsymbolic | waor1240 | Waorani | Waorani | auc |
johanssonsoundsymbolic | wara1303 | Warao | Warao | wba |
johanssonsoundsymbolic | wash1253 | Washo | Washo | was |
johanssonsoundsymbolic | yama1264 | Yamana | Yámana | yag |
johanssonsoundsymbolic | yana1271 | Yana | Yana | ynn |
johanssonsoundsymbolic | yele1255 | Yele | Yele | yle |
johanssonsoundsymbolic | yura1255 | Yuracare | Yuracaré | yuz |
johanssonsoundsymbolic | zuni1245 | Zuni | Zuni | zun |
joophonosemantic | basq1248 | Basque | Basque | eus |
joophonosemantic | buru1296 | Burushaski | Burushaski | bsk |
joophonosemantic | paez1247 | Paez | Páez | pbb |
joophonosemantic | sand1273 | Sandawe | Sandawe | sad |
joophonosemantic | wara1303 | Warao | Warao | wba |
joophonosemantic | maib1239 | MaiBrat | Mai Brat | ayz |
northeuralex | buru1296 | bsk | Burushaski | bsk |
northeuralex | basq1248 | eus | Basque | eus |
pharaocoracholaztecan | utoa1244 | ProtoUtoAztecan | PUA | |
transnewguineaorg | abun1252 | abun | Abun | kgr |
transnewguineaorg | abun1252 | abun-jembun | Abun (Jembun Dialect) | kgr |
transnewguineaorg | abun1252 | abun-senopi | Abun (Senopi Dialect) | kgr |
transnewguineaorg | boga1247 | bogaya | Bogaya | boq |
transnewguineaorg | burm1264 | burmeso | Burmeso | bzu |
transnewguineaorg | dama1272 | damal | Damal | uhn |
transnewguineaorg | demm1245 | dem | Dem | dem |
transnewguineaorg | dibi1240 | dibiyaso | Dibiyaso | dby |
transnewguineaorg | duna1248 | duna | Duna | duc |
transnewguineaorg | else1239 | elseng | Elseng | mrf |
transnewguineaorg | fasu1242 | fasu | Fasu | faa |
transnewguineaorg | kaki1249 | kaki-ae | Kaki Ae | tbd |
transnewguineaorg | kapo1250 | kapauri | Kapauri | khp |
transnewguineaorg | kehu1238 | keuw | Keuw | khh |
transnewguineaorg | kibi1239 | kibiri | Kibiri | prm |
transnewguineaorg | kolp1236 | kol | Kol | kol |
transnewguineaorg | kuot1243 | kuot | Kuot | kto |
transnewguineaorg | lavu1241 | lavukaleve | Lavukaleve | lvk |
transnewguineaorg | maib1239 | mai-brat | Mai Brat | ayz |
transnewguineaorg | mawe1251 | mawes | Mawes | mgk |
transnewguineaorg | maib1239 | maybrat | Maybrat | ayz |
transnewguineaorg | touo1238 | mbaniata | Mbaniata | tqu |
transnewguineaorg | bilu1245 | mbaniata-lokuru | Mbaniata (Lokuru Dialect) | blb |
transnewguineaorg | bilu1245 | mbilua | Mbilua | blb |
transnewguineaorg | bilu1245 | mbilua-ndovele | Mbilua (Ndovele Dialect) | blb |
transnewguineaorg | molo1262 | molof | Molof | msl |
transnewguineaorg | morb1239 | mor | Mor | moq |
transnewguineaorg | moro1289 | morori | Morori | mok |
transnewguineaorg | mpur1239 | mpur | Mpur | akc |
transnewguineaorg | mpur1239 | mpur-arfu | Mpur (Arfu Dialect) | akc |
transnewguineaorg | mpur1239 | mpur-kebar | Mpur (Kebar Dialect) | akc |
transnewguineaorg | yale1246 | nagatiman | Nagatiman | nce |
transnewguineaorg | fasu1242 | namumi | Fasu (Namumi Dialect) | faa |
transnewguineaorg | odia1239 | odiai | Odiai | bhf |
transnewguineaorg | papi1255 | papi | Papi | ppe |
transnewguineaorg | pawa1255 | pawaia | Pawaia | pwa |
transnewguineaorg | nucl1580 | proto-eleman | Proto-Eleman | |
transnewguineaorg | koia1260 | proto-koiarian | Proto-Koiarian | |
transnewguineaorg | kwal1257 | proto-kwalean | Proto-Kwalean | |
transnewguineaorg | lake1255 | proto-lakes-plain | Proto-Lakes-Plain | |
transnewguineaorg | lowe1437 | proto-lower-sepik | Proto-Lower-Sepik | |
transnewguineaorg | manu1261 | proto-manubaran | Proto-Manubaran | |
transnewguineaorg | nduu1242 | proto-ndu | Proto-Ndu | |
transnewguineaorg | nucl1709 | proto-trans-new-guinea | Proto-Trans-New-Guinea | |
transnewguineaorg | pura1257 | purari | Purari | iar |
transnewguineaorg | pyuu1245 | pyu | Pyu | pby |
transnewguineaorg | saus1247 | sause | Sause | sao |
transnewguineaorg | savo1255 | savosavo | Savosavo | svs |
transnewguineaorg | tabo1241 | tabo | Tabo | knv |
transnewguineaorg | tana1288 | tanahmerah | Tanahmerah | tcm |
transnewguineaorg | usku1243 | usku | Usku | ulf |
transnewguineaorg | wiru1244 | wiru | Wiru | wiu |
transnewguineaorg | yetf1238 | yetfa | Yetfa | yet |
utoaztecan | coah1252 | Coahuilteco | Coahuilteco | xcw |
utoaztecan | coto1248 | Cotaname | Cotaname | xcn |
utoaztecan | kara1289 | Karankawa | Karankawa | zkk |
utoaztecan | kere1287 | ProtoKeresan | Proto-Keresan | |
utoaztecan | zuni1245 | Zuni | Zuni | zun |
dataset | glottocode | language_ID | language_Name | ISO639P3code |
---|---|---|---|---|
backstromnorthernpakistan | Shimshal | Shimshal | ||
backstromnorthernpakistan | Chapursan | Chapursan | ||
backstromnorthernpakistan | Gupis | Gupis | ||
backstromnorthernpakistan | Gahorabad | Gahorabad | ||
backstromnorthernpakistan | DashkinAstor | Dashkin (Astor) | ||
backstromnorthernpakistan | KachuraJel | Kachura (Jel) | ||
backstromnorthernpakistan | Gultari | Gultari | ||
bdpa | Chimborazo | Chimborazo | ||
bdpa | Tena | Tena | ||
bdpa | Inkawasi | Inkawasi | ||
bdpa | Cajamarca | Cajamarca | ||
bdpa | Corongo | Corongo | ||
bdpa | Caraz | Caraz | ||
bdpa | Chavin | Chavín | ||
bdpa | Huancayo | Huancayo | ||
bdpa | Huancavelica | Huancavelica | ||
bdpa | Cuzco | Cuzco | ||
bdpa | Puno | Puno | ||
bdpa | Taquile | Taquile | ||
bdpa | Apolobamba | Apolobamba | ||
bdpa | Cochabamba | Cochabamba | ||
bdpa | Sucre | Sucre | ||
bdpa | Kawki | Kawki | ||
bdpa | Jaqaru | Jaqaru | ||
bdpa | Huancane | Huancané | ||
bdpa | Tiwanaku | Tiwanaku | ||
bdpa | Oruro | Oruro | ||
bdpa | Dashi | Dàshí | ||
bdpa | Gongxing | Gōngxìng | ||
bdpa | Jinxing | Jīnxīng | ||
bdpa | Mazhelong | Mǎzhělóng | ||
bdpa | AmericanEnglish | American English | ||
bdpa | CanadianEnglish | Canadian English | ||
bdpa | CentralGermanCologne | Central German (Cologne) | ||
bdpa | CentralGermanHonigberg | Central German (Honigberg) | ||
bdpa | CentralGermanLuxembourg | Central German (Luxembourg) | ||
bdpa | CentralGermanMurrhardt | Central German (Murrhardt) | ||
bdpa | Danish | Danish | ||
bdpa | DutchAntwerp | Dutch (Antwerp) | ||
bdpa | BelgianDutch | Belgian Dutch | ||
bdpa | DutchLimburg | Dutch (Limburg) | ||
bdpa | DutchOstend | Dutch (Ostend) | ||
bdpa | Dutch | Dutch | ||
bdpa | NewZealandEnglishAuckland | New Zealand English (Auckland) | ||
bdpa | EnglishBuckie | English (Buckie) | ||
bdpa | IndianEnglishDelhi | Indian English (Delhi) | ||
bdpa | NigerianEnglishIgbo | Nigerian English (Igbo) | ||
bdpa | SouthAfricanEnglishJohannisburg | South African English (Johannisburg) | ||
bdpa | EnglishLindisfarne | English (Lindisfarne) | ||
bdpa | EnglishLiverpool | English (Liverpool | ||
bdpa | EnglishLondon | English (London | ||
bdpa | EnglishNorthCarolina | English (North Carolina) | ||
bdpa | AustralianEnglishPerth | Australian English (Perth) | ||
bdpa | EnglishSingapore | English (Singapore) | ||
bdpa | English | English | ||
bdpa | EnglishTyrone | English (Tyrone) | ||
bdpa | Faroese | Faroese | ||
bdpa | German | German | ||
bdpa | HighGermanNorthAlsace | High German (North Alsace) | ||
bdpa | HighGermanBiel | High German (Biel) | ||
bdpa | HighGermanBodensee | High German (Bodensee) | ||
bdpa | HighGermanGraubuenden | High German (Graubuenden) | ||
bdpa | HighGermanHerrlisheim | High German (Herrlisheim) | ||
bdpa | HighGermanOrtisei | High German (Ortisei) | ||
bdpa | HighGermanTuebingen | High German (Tuebingen) | ||
bdpa | HighGermanWalser | High German (Walser) | ||
bdpa | Icelandic | Icelandic | ||
bdpa | LowGermanAchterhoek | Low German (Achterhoek) | ||
bdpa | LowGermanBargstedt | Low German (Bargstedt) | ||
bdpa | NorwegianStavanger | Norwegian (Stavanger) | ||
bdpa | Scottish | Scottish | ||
bdpa | SwedishSkane | Swedish (Skane) | ||
bdpa | SwedishStockholm | Swedish (Stockholm) | ||
bdpa | WestFrisianGrou | West Frisian (Grou) | ||
bdpa | YiddishNewYork | Yiddish (New York) | ||
bdpa | ProtoGermanic | Proto-Germanic | ||
bdpa | NorthMansi | North Mansi | ||
bdpa | MiddleLozvaMansi | Middle Lozva Mansi | ||
bdpa | LowerLozvaMansi | Lower Lozva Mansi | ||
bdpa | KondaMansi | Konda Mansi | ||
bdpa | TavdaMansi | Tavda Mansi | ||
bdpa | UpperDemjankaKhanti | Upper Demjanka Khanti | ||
bdpa | KondaKhanti | Konda Khanti | ||
bdpa | NizjamKhanti | Nizjam Khanti | ||
bdpa | SherkaliKhanti | Sherkali Khanti | ||
bdpa | VakhKhanti | Vakh Khanti | ||
bdpa | VerkhneKalimskKhanti | Verkhne Kalimsk Khanti | ||
bdpa | VasjuganKhanti | Vasjugan Khanti | ||
bdpa | VartovskojeKhanti | Vartovskoje Khanti | ||
bdpa | LikrisovskojeKhanti | Likrisovskoje Khanti | ||
bdpa | MalyjJuganKhanti | Malyj Jugan Khanti | ||
bdpa | TremjuganKhanti | Tremjugan Khanti | ||
bdpa | JuganKhanti | Jugan Khanti | ||
bdpa | KazimKhanti | Kazim Khanti | ||
bdpa | SinjaKhanti | Sinja Khanti | ||
bdpa | ObdorskKhanti | Obdorsk Khanti | ||
bdpa | PelimkaMansi | Pelimka Mansi | ||
bdpa | Italian | Italian | ||
bdpa | French | French | ||
bdpa | Occitan | Occitan | ||
bdpa | Ligurian | Ligurian | ||
bdpa | LombardWest | Lombard (West) | ||
bdpa | LombardEast | Lombard (East) | ||
bdpa | Ladino | Ladino | ||
bdpa | Venetian | Venetian | ||
bowernpny | Gairi | Gairi | ||
bowernpny | JaruMcC | Jaru-McC | ||
bowernpny | Karree | Karree | ||
bowernpny | KukuYalanjiCurr | KukuYalanjiCurr | ||
bowernpny | Kungadutyi | Kungadutyi | ||
bowernpny | MangalaMcK | MangalaMcK | ||
bowernpny | MangalaNW | MangalaNW | ||
bowernpny | MaryRiverandBunyaBunyaCountry | Mary River and Bunya Bunya Country | ||
bowernpny | MountFreelingDiyari | Mount Freeling Diyari | ||
bowernpny | MudburraMcC | Mudburra-McC | ||
bowernpny | NggoiMwoi | Ng'goi Mwoi | ||
bowernpny | WalmajarriBilliluna | WalmajarriBilliluna | ||
bowernpny | WalmajarriHR | WalmajarriHR | ||
bowernpny | WalmajarriNW | WalmajarriNW | ||
bowernpny | WangkumaraMcDWur | WangkumaraMcDWur | ||
chenhmongmien | WesternQiandong | Qiandong, West | ||
chindialectsurvey | TaungthaWethet-T-1 | Taungtha (Wethet) | rtc | |
chindialectsurvey | ThaiphumRengkheng-T-7 | Thaiphum (Rengkheng) | cth | |
chindialectsurvey | DoituHetsawlay-U-11 | Doitu (Hetsawlay) | csj | |
chindialectsurvey | LaituKhuasung-U-12 | Laitu (Khuasung) | clj | |
chindialectsurvey | LaisawThuHtayKung-A | Laisaw Thu Htay Kung | clj | |
chindialectsurvey | SonglaiHettui8KarchaungHettui-A | Songlai-Hettui 8Karchaung (Hettui) | csj | |
chindialectsurvey | SonglaiMaungUmSong1MaungUmSong-A | Songlai-Maung Um (Song) 1Maung Um (Song) | csj | |
chindialectsurvey | LaituAhongdong-A | Laitu Ahongdong | clj | |
chindialectsurvey | KaangKruk-A | Kaang Kruk | ckn | |
chingelong | Gelong | Gelong | ||
deepadungpalaung | ChuDongGua | Chu Dong Gua | ||
deepadungpalaung | ChaYeQing | Cha Ye Qing | ||
deepadungpalaung | NamHsan | Namhsan | ||
deepadungpalaung | KhunHawt | Khun Hawt | ||
deepadungpalaung | HtanHsan | Htan Hsan | ||
deepadungpalaung | PangKham | Pangkham | ||
deepadungpalaung | ManLoi | Man Loi | ||
deepadungpalaung | NyaungGone | Nyaung Gone | ||
deepadungpalaung | BanPaw | Ban Paw | ||
deepadungpalaung | NoeLae | Noe Lae | ||
deepadungpalaung | PongNuea | Pong Nuea | ||
duonglachi | BanPhungLaChi | La Chí Bản Phùng | ||
duonglachi | NungDinLaChi | Nùng Dín | ||
felekesemitic | Gogot | Gogot | ||
felekesemitic | Oromo | Oromo | ||
gaotb | MenbaCuona | Menba (Cuona) | ||
gaotb | MenbaMotuo | Menba (Motuo) | ||
gaotb | YiMile | Yi (Mile) | ||
gaotb | BikaHani | Hani (Bika) | ||
gaotb | HayaHani | Hani (Haya) | ||
gaotb | HaobaiHani | Hani (Haobai) | ||
gerarditupi | Tenharim | Tenharim | ||
gerarditupi | WayampiJ | Wayampí J | ||
gerarditupi | GuaraniAntigo | Guarani Antigo | ||
hsiuhmongmien | NaMeoTuyenQuang | Na Meo (Tuyen Quang) | ||
hsiuhmongmien | Zhenmin | Zhenmin | ||
hsiuhmongmien | Guncen | Guncen | ||
hsiuhmongmien | Datu | Datu | ||
hsiuhmongmien | Yangpai | Yangpai | ||
hsiuhmongmien | Xiangao | Xiang’ao | ||
hsiuhmongmien | WesternQiandong | Heba | ||
hsiuhmongmien | Baixing | Baixing | ||
kleinewillinghoeferbikwinjen | Joole | Joole | ||
leejaponic | MiddleJapanese | Middle Japanese | ||
leejaponic | Nara | Nara | ||
bremerberta | BelejeGonfoye | Beleje Gonfoye | ||
leekoreanic | Gangwon | Gangwon | ||
peirosaustroasiatic | Bahnar | Bahnar | bdq | |
peirosaustroasiatic | Hadang | Hadang | ||
peirosaustroasiatic | Hre | Hre | hre | |
peirosaustroasiatic | Je | Je | ||
peirosaustroasiatic | Kadong | Kadong | ||
peirosaustroasiatic | Ma1 | Ma1 | ||
peirosaustroasiatic | Ma2 | Ma2 | ||
peirosaustroasiatic | Panong | Panong | ||
peirosaustroasiatic | Veh | Veh | ||
peirosaustroasiatic | Bru | Bru | bru | |
peirosaustroasiatic | BruVK | BruVK | xhv | |
peirosaustroasiatic | Dakkang | Dakkang | ||
peirosaustroasiatic | Kantu | Kantu | ||
peirosaustroasiatic | Mak | Mak | ||
peirosaustroasiatic | Neu | Neu | ||
peirosaustroasiatic | Ong | Ong | ||
peirosaustroasiatic | Taoih | Taoih | ||
peirosaustroasiatic | Iduh | Iduh | ||
peirosaustroasiatic | Khmu | Khmu | kjg | |
peirosaustroasiatic | KxinhMul | KxinhMul | ||
peirosaustroasiatic | Pray | Pray | pry | |
peirosaustroasiatic | Paliu | Paliu | ||
peirosaustroasiatic | Gantang | Gantang | ||
peirosaustroasiatic | Guanshuang | Guanshuang | ||
peirosaustroasiatic | Khamet | Khamet | ||
peirosaustroasiatic | Khme | Khme | ||
peirosaustroasiatic | Mane | Man'e | ||
peirosaustroasiatic | Mangan | Mang'an | ||
peirosaustroasiatic | Pangpin | Pangpin | ||
peirosaustroasiatic | Plang | Plang | ||
peirosaustroasiatic | Shuangdiang | Shuangdiang | ||
peirosaustroasiatic | Wa | Wa | ||
peirosaustroasiatic | Yongde | Yongde | ||
peirosaustroasiatic | Arem | Arem | ||
peirosaustroasiatic | Cuoi | Cuoi | ||
peirosaustroasiatic | KhaPhong | KhaPhong | ||
peirosaustroasiatic | Liha | Liha | ||
peirosaustroasiatic | MuongKoi | MuongKoi | ||
peirosaustroasiatic | PhongV | Phong(V) | ||
peirosaustroasiatic | Tuum | Tuum | ||
peirosaustroasiatic | ThoMon | Tho Mon | ||
savelyevturkic | CodexCumanicus | Cuman | ||
servamalagasy | AntandroyAmbovombe | Antandroy | ||
servamalagasy | MikeaAmpoakafo | Mikea | ||
servamalagasy | BetsimisarakaFenoarivo-Est | Betsimisaraka | ||
servamalagasy | SakalavaMaintirano | Sakalava | ||
servamalagasy | SakalavaMahajanga | Sakalava | ||
servamalagasy | AntaimoroManakara | Antaimoro | ||
servamalagasy | AntambohoakaMananjary | Antambohoaka | ||
servamalagasy | AntaisakaVangaindrano | Antaisaka | ||
servamalagasy | BetsileoAmbositra | Betsileo | ||
servamalagasy | BetsileoAmbalavao | Betsileo | ||
servamalagasy | AntanalanaItampolo | Antanalana | ||
servamalagasy | AntanosyBezaha | Antanosy | ||
servamalagasy | TanalaIfanadiana | Tanala | ||
servamalagasy | AntanalanaManorofify | Antanalana | ||
servamalagasy | AntandroyToliara | Antandroy | ||
servamalagasy | AntanalanaAnakao | Antanalana | ||
servamalagasy | NosyBorahaAmbodifotatra | Nosy Boraha | ||
servamalagasy | AntanosyBelamoty | Antanosy | ||
servamalagasy | AntandroyTsihombe | Antandroy | ||
suntb | AmdoTibetanBlabrang | Tibetan (Amdo:Bla-brang) | ||
suntb | AmdoTibetanZeku | Tibetan (Amdo:Zeku) | ||
transnewguineaorg | magi-musak | Magɨ | ||
transnewguineaorg | proto-eleman-koriki | Proto-Eleman-Koriki | ||
transnewguineaorg | proto-isumrud | Proto-Isumrud | ||
transnewguineaorg | proto-north-adelbert | Proto-North-Adelbert | ||
transnewguineaorg | proto-pihom | Proto-Pihom | ||
transnewguineaorg | proto-sub-rai | Proto-Sub-Rai | ||
wangbai | Dashi | Dashi | ||
wangbai | Jinxing | Jinxing | ||
wangbai | Mazhelong | Mazhelong | ||
wangbai | Gongxing | Gongxing | ||
wangbai | ProtoBai | Proto-Bai | ||
yanglalo | ProtoLalo | Proto-Lalo | ||
yangyi | Ghomozo | Ghomozo | ||
yangyi | EDiaocao | E-Diaocao | ||
yangyi | EHoushan | E-Houshan | ||
yangyi | ETaoshu | E-Taoshu | ||
yangyi | SEGaoping | SE-Gaoping | ||
yangyi | Nise | Nise | ||
yangyi | Noso | Noso | ||
yangyi | LopeAwuChen2010 | Lope (Awu) | ||
yangyi | LopeAwuYYFC1983 | Lope (Awu)2 | ||
yangyi | Lidim | Lidim (Tianba) | ||
zhivlovobugrian | lowerlozvamansi | Lower Lozva Mansi | ||
zhoubizic | ProtoBizic | Proto-Bizic | ||
abvdoceanic | Riwo | Riwo | ||
abvdoceanic | TesmbolUsus | Tesmbol (Usus) | ||
abvdoceanic | SivitiBeterbuJericho | Siviti (Beterbu, Jericho) | ||
abvdoceanic | atarxobuGunwar | ßatarxobu (Gunwar) | ||
abvdoceanic | Najit | Najit | ||
abvdoceanic | AlavasWowoWowo1 | Alavas-Wowo (Wowo 1) | ||
abvdoceanic | MandriFarun16291 | Mandri (Farun) 162-91 |
Most of the cases with missing glottocodes are not easy to fix,as we have dialects here, mainly in BDPA, etc. These were often deliberately ignored and would also not be recommended for tests.
Yeah, the family == None's look like isolates or proto-languages so should be ignored.
I've updated servamalagasy here, the things in abvdoceanic and transnewguinea have no glottocodes as they're either proto languages glottolog doesn't believe in, or are just not in glottolog (although I figured out Riwo
was a variety of Gedaged, so that should now be updated).
I'll look at peirosaustroasiatic shortly and see if I can match some of those up.
Note that abvdoceanic will have a big update in a few days which will fix the Riwo issue
Thanks ! I'm downloading updates regularly so I should get all the nice corrections anytime more come in.
I have two more questions:
servamalagasi
, there are several rows with the same glottocode and name in the language table. Would it be possible to adjust their names to make them distinguishable on the basis of just this pair of information ?dataset | glottocode | language_ID | language_Name | ISO639P3code |
---|---|---|---|---|
servamalagasy | anta1255 | AntankaranaVohemar | Antankarana | xmv |
servamalagasy | bara1369 | BaraRanohira | Bara | bhr |
servamalagasy | bets1235 | BetsileoAmbohimahasoa | Betsileo | |
servamalagasy | maha1309 | MahafalyEjeda | Mahafaly | |
servamalagasy | meri1243 | MerinaAnalavory | Merina | |
servamalagasy | nort2890 | BetsimisarakaMahanoro | Betsimisaraka | bmm |
servamalagasy | nort2890 | BetsimisarakaAntsiranana | Betsimisaraka | bmm |
servamalagasy | nort2890 | BetsimisarakaBrickaville | Betsimisaraka | bmm |
servamalagasy | nort2890 | BetsimisarakaToamasina | Betsimisaraka | bmm |
servamalagasy | nort2890 | BetsimisarakaMananara | Betsimisaraka | bmm |
servamalagasy | nort2890 | BetsimisarakaMaroantsetra | Betsimisaraka | bmm |
servamalagasy | saka1291 | SakalavaMorondava | Sakalava | skg |
servamalagasy | saka1291 | SakalavaMiandrivazo | Sakalava | skg |
servamalagasy | saka1291 | SakalavaBesalampy | Sakalava | skg |
servamalagasy | saka1291 | SakalavaBeloniTsiribihina | Sakalava | skg |
servamalagasy | siha1244 | SihanakaMoraranoChrome | Sihanaka | |
servamalagasy | siha1244 | SihanakaAndilamena | Sihanaka | |
servamalagasy | sout2920 | BetsimisarakaSahavato | Betsimisaraka | bzc |
servamalagasy | tsim1257 | TsimihetyMampikony | Tsimihety | xmw |
servamalagasy | tsim1257 | TsimihetyAndapa | Tsimihety | xmw |
servamalagasy | tsim1257 | TsimihetyAntsohihy | Tsimihety | xmw |
servamalagasy | vezo1235 | VezoMorombe | Vezo | |
servamalagasy | vezo1235 | VezoMorondava | Vezo |
If not, I can use the area to distinguish them, but this is a problem I encounter only with this single dataset, hence I am checking with you.
@XachaB, it is no problem to adjust the names in the lexibank dataset, all that one would need to do is tomodify the name in the lexibank dataset. But my own take on most dataset that I have worked with so far is that the ID is the better representative of languages for the purpose of plotting their names and storing them, etc., which is why the IDs are not numeric whereevery I found time to prepare this. So what I want to say: if you run into problems, since your code distinguishes languages by their name, I'd recommend to switch to the ID instead, as we also do in cltoolkit for this very reason.
As to the cognate datasets, @XachaB, we can make a list, but be aware that partial cognates are so far mostly only done by myself, so there is a single coder, as nobody else has coded partial cognates so far. My own conviction with respect to sound correspondences is that one always would need some version of partial cognates. But in all lexicore-CogCore datasets, the cognates are typically not partial cognates. Furthermore, if you want to detect which dataset uses partial cognates, you can do so by checking if the segment_slice
attribute is defined for the lexeme in CLDF, which we use to render partial cognates.
Ah, and the final problem is: even if we NEED partial cognates, like for many ST languages, we may not HAVE them. So you'd need a list that tells you, which data comes along in segmented form. But this could in theory also be checked automatically, but you'll encounter diversity, with one dataset being segmented for the same family, and one not.
So it is not trivial, if you want to compare ACROSS datasets, what to do here. If you want to compare INSIDE datasets, I can provide all information.
I was maybe unclear, but indeed, I am not asking if the dataset provides partial cogids (this is easy to see). I am predicting cognates using lingpy, and am wondering when to use Partial and when to use Lexstat.
Of course, I can only use Partial when there are segmented words (which is also easy to check). But using this to guess whether I "should" is sort of a problem, since as you say there may be just a few words, or just a few languages, with segmented words. And it is indeed worse when comparing, as I am doing, across datasets, as two datasets for the same family might not both have the segmentation into morphemes.
I was hoping to improve the situation by:
Do you think there is some hope for this strategy ?
*: I understand your general conviction that partial cognates make more sense, but surely, there must be languages where doing whole word cognates is an acceptable approximation, and some where we just can't do without partial cognates ?
In fact, Partial in theory yields the same results if you use it on non-segmented wordlists. So you could just say: I use Partial in all cases.
So this would then solve your problem pragmatically.
And if you restrict experiments to certain salient contexts, where you have limited gaps in the data, since gaps are either due to sound change, or due to morphemes missing, you could maybe even get along well with lexstat itself, even on segmented datasets.
That's indeed a pragmatic answer !
The problem will remain of having diversity, with one dataset being segmented for the same family, and one not. In that case, we will get a terrible output if two cognates are on one side segmented and on the other unsegmented.
Should I take that as an argument for stopping cross-dataset comparisons ? I hope not, but I don't really see any obvious solutions.
Though note that this problem will leed to under-detection, rather than over-detection. Missing some cognates (bad recall) is less of a problem than having bad precision in the specific case of sound correspondences.
In my opinion, cross-dataset comparisons require quite some preprocessing, which makes them notoriously difficult to handle (e.g. in https://doi.org/10.12688/openreseurope.13843.2 I worked a lot of time with the data and made several concept coverage checks to get the right balance and have still lots of missing data). Cross-dataset comparison would require to derive a list of some 300 concepts which have some 80% of coverage in all datasets we select, and a mutual coverage of at least 150 words per language pair, without exceptions. So it would be a dataset we derive from the other datasets. In such a dataset, we could just delete all segmentations. But if we go this way, we should start already and make a dedicated lexibank dataset that derives the data and also adds some conversions to the original word forms, thus, similar to lexibank-analysed, but with forms. Maybe this is even the best way to go? In this way, we can also kick out low-coverage languages, etc.
The more I think about it, the more I think we should do exactly that: make a dedicated NEW dataset using the lexibank-analysed procedure which would give us the best of the best what we have (some ~ 20 language families, high coverage among them, etc.). From there, one would then plug in your code.
So far my code loads and processes separate datasets. There would be quite a few changes if we moved to a special smaller compounded dataset, then me running on it. Let me know in advance if you decide to do that !
Re: entirely removing segmentations, that too is an option even without needing to make a separate dataset, of course.
My preprocessing currently already ignores a lot of data (isolates, proto families, glottolog issues, loan words, etc). If you were to generate a list of concepts, I could easily dynamically pass that to further limit the set of data points I am working on (without needing to create any dedicated dataset).
For now, I have a condition where if the minimum mutual coverage in a family (across datasets) is below 100, I use the SCA method instead of lexstat+infomap. I could also just drop the family in that case, though of course that would even further reduce the amount of usable data.
And if you restrict experiments to certain salient contexts, where you have limited gaps in the data, since gaps are either due to sound change, or due to morphemes missing, you could maybe even get along well with lexstat itself, even on segmented datasets.
I hadn't seen this suggestion. Can you clarify what these "salient contexts" would be ? In any case, deletions being because of morphology is another big problem I am encountering.
LingPy allows to make an automatic syllabification and to derive some basic contexts, like pre-vocalic, post-vocalic, etc. In addition, one can reduce an analysis to word-initials. In addition, profiling alignments by checking how many consecutive gaps occur allows one to only derive those parts of an alignment where a sufficiently large number of columns is filled with sounds, specifically consecutively. Thus, while
would not really surprise me,
would show the loss of a whole syllable and thus unlikely result from regular sound change, at least not if you look at shallow time depths.
As to the preprocessing: I see one danger if the code does things en passent and then processes and outputs sound correspondences. The advantage of using the admittedly new idea we pushed in lexibank-analysed is that these steps are made explicit. This helps to debug and to deal with errors directly, already when constructing the CLDF dataset from other CLDF datasets.
All code that has been written could be easily added to specific commands in such a repository, and one could make use of cltoolkit's Wordlist class, which was designed to allow for an easy integration of cldf datasets from different sources.
BTW: I am not sure how reliable the exclusion of loan words is, if it is only annotated sporadically. One should assume that correspondence patterns of low attestation would also allow us to simply exclude those cases later on?
Hi,
Background: I am writing the methods section for the sound correspondences paper. I evaluated the simple "cognate detection" method we used against the expert annotations in Lexicore datasets. The results are very poor (precision 59%, recall 34%). Since lingpy already has a class to do cognate detection properly (LexStat), I want to refactor the lexitools correspondences code to use it instead. Otherwise, I can't trust the results.
Problem:
I read the doc: LexStat expects a qlc filename in input.
However, I am reading several CLDF datasets, combining them per genus, and searching for cognates in each. I would need to pass lists of rows from different CLDF datasets (or some object wrapping dataset rows).
I will also need to disregard LexStat output whenever there was actually some expert annotation, since for each genera, some data will come from a single dataset with cognacy jugments, and some data will have been aggregated across datasets. While this seems simple (for each pair, check if it was annotated), I would like to be sure whether there isn't a trick that would make this setup a bad idea.
Questions:
Notes:
@LinguList Do you have a solution ?