lexibank / lsi

CLDF dataset derived from Grierson's "Linguistic Survey of India" from 1928
https://lsi.clld.org
Creative Commons Attribution 4.0 International
1 stars 0 forks source link

Orthography profile creation #4

Closed PhyloStar closed 4 years ago

PhyloStar commented 4 years ago

What are the steps followed for the creation of orthography profile? Is it some kind of regex matching or completely manual?

How should the profile look so that it can be processed by lexibank package?

LinguList commented 4 years ago

I just created an initial orthoprofile, you find it in etc/orthography.tsv.

Note that this shows > 1000 lines now, which is not surprising with all the data.

What you also need to know: we represent context in a simple form ^t means "t in the beginning of a word" and t$ means "t in end of word". So when correcting this profile, this should be considered, it is not a strange symbol, but indicates and guides you, where that sound occurs.

Furthermore, we use NULL to indicate that a segment should be deleted.

Please have a look first, and maybe work on the first 100 lines. I can then check and help by reviewing it.

PhyloStar commented 4 years ago

Does <?> mean we need to map the symbol?

LinguList commented 4 years ago

Exactly, you have to decide what it is, as the algorithm is not clear about it.

PhyloStar commented 4 years ago

https://github.com/lexibank/lsi/blob/master/etc/orthography.tsv#L23

This shows a missing item in the original data. It should be deleted.

PhyloStar commented 4 years ago

https://github.com/lexibank/lsi/blob/master/etc/orthography.tsv#L24

A period below a consonant indicates that it is a retroflex. Should I delete these kind of entries?

LinguList commented 4 years ago

then you write NONE in the field for IPA

LinguList commented 4 years ago

yes, but you have to do more: you have to look for consonants plus period in teh data and add them in a new row, so we have the right conversion, lingpy didn't know this period was a diacritic, I assume.

PhyloStar commented 4 years ago

yes, but you have to do more: you have to look for consonants plus period in teh data and add them in a new row, so we have the right conversion, lingpy didn't know this period was a diacritic, I assume.

Perfect! I can do this. I will do these changes and push a version.

PhyloStar commented 4 years ago

https://github.com/lexibank/lsi/blob/master/etc/orthography.tsv#L518

The language name "Gyåmi" is in the example. I think this relates to a tab error that was fixed.

PhyloStar commented 4 years ago

Is this the command you are using cldfbench lexibank.init_profile lsi -f to generate orthography profile?

LinguList commented 4 years ago

No. I used lingpy to create the initial profile from the forms in cldf. But I also do not really understand what command you are looking for...

To check if the profile does a good job, type

cldfbench lexibank.makecldf lsi

or type

cldfbench lexibank.check_profile lsi

this will point you to potential errors and the like.

PhyloStar commented 4 years ago

I am trying to understand the workflow here.

LinguList commented 4 years ago

Yes, this is what you should do: check if the profile is getting "better", with fewer errors, if the feedback works.

If there are notorious words you cannot capture, you can also add a file etc/lexemes.tsv, with the following columns:

LEXEME,REPLACEMENT,COMMENT

But you need to take the form as it was given in the digitization, to convert it.

The workflow essentially takes the entry from "form" in cldf/forms.csv and treats it with the orthography profile to convert it to segments accepted by LingPy (and phoible, etc.). So you should once in a while check what the "form" in "cldf/forms.csv" actually looks like.

And yes, treat "weak aspiration" as breathy voiced, this is fine for me, provided there is no "breathy voiced" in the data.

PhyloStar commented 4 years ago

I did two rounds of the orthography profile. When I test with cldfbench check_profile command (in virtual environment), I get about 505 errors. I am attaching error log. Can you please tell what the error lines mean? error.log

PhyloStar commented 4 years ago

I think I figured how to work with cldfbench. I uploaded a orthography mapping. The number of errors are below 50 segments.

LinguList commented 4 years ago

excellent, that's what I was hoping for, in fact, cldfbench should be very easy to use, you just need to use virtual environments to make sure dependencies are all in place ;)

I'll have a look later and help in improving.

Two tips:

$ cldfbench lexibank.makecldf lsi --dev

is faster than without --dev

and

$ cldfbench lexibank.check_profile lsi

is even faster, good for debuggin the profile.

PhyloStar commented 4 years ago

excellent, that's what I was hoping for, in fact, cldfbench should be very easy to use, you just need to use virtual environments to make sure dependencies are all in place ;)

Yes. This is really convenient.

I'll have a look later and help in improving. Two tips: $ cldfbench lexibank.makecldf lsi --dev is faster than without --dev

$ cldfbench lexibank.check_profile lsi is even faster, good for debuggin the profile.

Using this regularly to debug but was using the makecldf without --dev option which takes about 1 minute.