lexibank / diacl

CLDF dataset derived from Carling's "Diachronic Atlas of Comparative Linguistics" from 2017
Creative Commons Attribution 4.0 International
2 stars 0 forks source link

Problems in Serbian #11

Open LinguList opened 2 years ago

LinguList commented 2 years ago

There are some mismappings, as they have like 6 words for DEER in the data. We were informed by somebody who wrote to Joshua Jackson, who then wrote to me:

Cow would be what is translated as Deer. 
Krava = Cow
Vo = Ox
Bik = Bull
Jelen = Deer
Jelena is one of the common names in Serbia (likely related to Helen rather than Jelen, though)
Right beneath is KRV, which is BLOOD and definitely not the meat. MESO is meat. 
Above that KORA = it is a bark, but leather is KOŽA
Am I missing something important? 
Jegulja is an EEL, not a snake, ZMIJA is a snake. 
Konj is a male horse, kobila is a mare. 
Jare is NOT a lamb. Jagnje (janje) is a lamb, jare is a baby goat, not a sheep. 
Jagoda is a strawberry, not a grape. Grožđe is a term for the grapes, grozd is singular.”

I suggest we manually correct these cases via Lexemes. I would also inform the DIACL editors about this.

Or, @chrzyki, @xrotwang, is it possible that the error (something swapped here) is on the side of the pylexibank script?

LinguList commented 2 years ago

BTW: checking with German, we have the same problems for DEER.

https://clics.clld.org/languages/diacl-41700

LinguList commented 2 years ago

If one checks diacl, it becomes clear that they have mapped a huge number of partly related terms to one master concept.

https://diacl.ht.lu.se/WordList/Index

This problem is also but less problematically present in the Swadesh collection.

The problem is that DIACL did in some sense some Concepticon mapping, however, one to their internal concept lists, which are often much broader than what we'd do in Concepticon. Since all words in the database have meaning strings, one could circumvent this by making a master list of all meaning glosses we find in the data.

In the current form, however, it is unclear if the data is well aggregated into CLICS.

chrzyki commented 2 years ago

Good catch and thanks for relaying the issue. Given the relatively specific relations I would hope that there isn't too much of an effect on CLICS-based analyses (i.e. most of the mappings will be very rare), but I fully agree: In this state it's not something that should be used in CLICS & Co. I think your suggestion (i.e. list of all meaning glosses, map) sounds good!

LinguList commented 2 years ago

So for CLICS4, we would either have fixed this issue by doing a re-mapping, or we'd not include it there, since this kind of mapping makes people who know the languages get upset, and we would like to avoid that. DIACL has the meaning glosses, so they use the concepts differently than we do in CLICS, so we do well in only aggregating from DIACL when we know that it corresponds to our models.

FredericBlum commented 1 month ago

In addition to the concept problems, we also have no segmentation because there are no orthography profiles. Since it is unlikely that we can do a full remapping until the LB 2.0 release and there are no capacities of student assistants at the moment, I'd vote to retire the dataset from Lexibank. @LinguList @chrzyki Would you agree with this?

FredericBlum commented 1 month ago

I didnt realize that diacl was never part of the LB release in the first place. Thanks @chrzyki for clarifying

LinguList commented 1 month ago

We skipped it after we found too many problems in CLICS3. They just link any concept to any gloss. So they may end up having a term "butterfly" and link it to "insect" in their internal concepticon (!).