lexibank / lsi

CLDF dataset derived from Grierson's "Linguistic Survey of India" from 1928
https://lsi.clld.org
Creative Commons Attribution 4.0 International
1 stars 0 forks source link

orthography profiles: certain problems show up when checking Chinese #22

Open LinguList opened 4 years ago

LinguList commented 4 years ago

I checked Chinese and found that all sounds are represented as "breathy unvoiced", so the marker that converts them to breathy is wrongly shown as breathy voice.

The problem is: Grierson's alphabet is limited, they re-use characters they should not really use, and this will always lead to problems, of course.

Our task is to find the best compromise, so that the transcriptions look okay still.

LinguList commented 4 years ago

But we can in fact also make a little comparison of the words with some of the dialect data we have and see how different they are. Grierson has sometimes taken data from others without adjusting the orthography, etc.

But we need another thorough check of the orthography profile.

LinguList commented 4 years ago

I have a rather radical new solution. This will require more work, but will allow us to target errors more consistently in the future.

I am currently preparing individual orthography profiles per language.

These are based on our master profile, and in the future, we should edit for each language individually. We can theoretically just leave them as they are, but I'd propose we change where we know better, e.g., in Mandarin, etc., when we find errors there.

PhyloStar commented 4 years ago

Super. Was about to ask if it is possible to hardwire language information into lexemes.tsv file.

LinguList commented 4 years ago

I already did this now, but differently, I'll now push my changes.

PhyloStar commented 4 years ago

Another language example is Toda which has 6 retroflex consonants whereas LSI is not even close.

https://lsi.clld.org/languages/TODA#tipa

Wiki article here: https://en.wikipedia.org/wiki/Toda_language

LinguList commented 4 years ago

If you now check, you find that we have a folder etc/orthography with the extracted profile (only parts USED) for each language. This is easy to edit, and if one finds a language that we think is important, we can just update it. We can also ask future users to help, etc.

LinguList commented 4 years ago

I used a script raw/orthography.py to do this. But I don't recommend to use it, as it overwrites the whole orthography/ folder. Rather if you are interested, have a look.

I also swap tones by adding them to the lexemes.tsv now.

LinguList commented 4 years ago

And I changed representation to aspiration for all cases, which had the apostrophe. It is better to use "aspirated" than "breathy", as this is the more traditional way of handling it.

PhyloStar commented 4 years ago

Individual file method looks so tractable. :man_dancing:

Aspiration: Super.

Looking into the tones.

PhyloStar commented 4 years ago

Is the segmentation using the individual language orthography profiles?

LinguList commented 4 years ago

Yes. The thing is: if you run the whole process (but make sure to download the most recent concepticon with the grierson concept list, I also modified this now), it will first check if there's an individual profile and use it. You can test it by modifying something in the orthography/* folder and seeing if it works.

LinguList commented 4 years ago

I just adjusted the Chinese profile.

LinguList commented 4 years ago

But Grierson's data is not very consistent there, so we cannot do too much, but it is okay: we can say: look we can in principle do this nicely for every language, but for now, there may be inconsistencies, etc.

PhyloStar commented 4 years ago

Okay. I will go through the list of languages that Grierson lists and will change the orthography files in the evening and tomorrow. I will push the fresh files later.

I pulled the concepticon data and could run makecldf command without errors.

LinguList commented 4 years ago

Very good. I think, these changes may be a bit tedious, but we can only win by being more explicit here. And our workflow was still good: do it for all languages the first time, then you have less things to correct later.

PhyloStar commented 3 years ago

I fixed a number of language profiles here. I reran the cognate detection, phonemes tables and correlations between LSI vs. ASJP and PHOIBLE databases. I looked at the neighbornet of Dravidian tree and it looks close to what we know about the family.