Closed PhyloStar closed 4 years ago
I just created an initial orthoprofile, you find it in etc/orthography.tsv
.
Note that this shows > 1000 lines now, which is not surprising with all the data.
What you also need to know: we represent context in a simple form ^t
means "t in the beginning of a word" and t$
means "t in end of word". So when correcting this profile, this should be considered, it is not a strange symbol, but indicates and guides you, where that sound occurs.
Furthermore, we use NULL to indicate that a segment should be deleted.
Please have a look first, and maybe work on the first 100 lines. I can then check and help by reviewing it.
Does <?> mean we need to map the symbol?
Exactly, you have to decide what it is, as the algorithm is not clear about it.
https://github.com/lexibank/lsi/blob/master/etc/orthography.tsv#L23
This shows a missing item in the original data. It should be deleted.
https://github.com/lexibank/lsi/blob/master/etc/orthography.tsv#L24
A period below a consonant indicates that it is a retroflex. Should I delete these kind of entries?
then you write NONE in the field for IPA
yes, but you have to do more: you have to look for consonants plus period in teh data and add them in a new row, so we have the right conversion, lingpy didn't know this period was a diacritic, I assume.
yes, but you have to do more: you have to look for consonants plus period in teh data and add them in a new row, so we have the right conversion, lingpy didn't know this period was a diacritic, I assume.
Perfect! I can do this. I will do these changes and push a version.
https://github.com/lexibank/lsi/blob/master/etc/orthography.tsv#L518
The language name "Gyåmi" is in the example. I think this relates to a tab error that was fixed.
Is this the command you are using cldfbench lexibank.init_profile lsi -f
to generate orthography profile?
No. I used lingpy to create the initial profile from the forms in cldf. But I also do not really understand what command you are looking for...
To check if the profile does a good job, type
cldfbench lexibank.makecldf lsi
or type
cldfbench lexibank.check_profile lsi
this will point you to potential errors and the like.
I am trying to understand the workflow here.
etc/orthography.tsv
and then pushed to the repo. Is this right?God
concept. This does not make sense in Indian languages but the original manuscript does show it anyway.h
preceded by a consonant shows aspiration whereas '
shows weak aspiration. Should we treat it as ʰ vs. ʱ ?Yes, this is what you should do: check if the profile is getting "better", with fewer errors, if the feedback works.
If there are notorious words you cannot capture, you can also add a file etc/lexemes.tsv, with the following columns:
LEXEME,REPLACEMENT,COMMENT
But you need to take the form as it was given in the digitization, to convert it.
The workflow essentially takes the entry from "form" in cldf/forms.csv and treats it with the orthography profile to convert it to segments accepted by LingPy (and phoible, etc.). So you should once in a while check what the "form" in "cldf/forms.csv" actually looks like.
And yes, treat "weak aspiration" as breathy voiced, this is fine for me, provided there is no "breathy voiced" in the data.
I did two rounds of the orthography profile. When I test with cldfbench check_profile command (in virtual environment), I get about 505 errors. I am attaching error log. Can you please tell what the error lines mean? error.log
I think I figured how to work with cldfbench. I uploaded a orthography mapping. The number of errors are below 50 segments.
excellent, that's what I was hoping for, in fact, cldfbench should be very easy to use, you just need to use virtual environments to make sure dependencies are all in place ;)
I'll have a look later and help in improving.
Two tips:
$ cldfbench lexibank.makecldf lsi --dev
is faster than without --dev
and
$ cldfbench lexibank.check_profile lsi
is even faster, good for debuggin the profile.
excellent, that's what I was hoping for, in fact, cldfbench should be very easy to use, you just need to use virtual environments to make sure dependencies are all in place ;)
Yes. This is really convenient.
I'll have a look later and help in improving. Two tips: $ cldfbench lexibank.makecldf lsi --dev is faster than without --dev
$ cldfbench lexibank.check_profile lsi is even faster, good for debuggin the profile.
Using this regularly to debug but was using the makecldf
without --dev option which takes about 1 minute.
What are the steps followed for the creation of orthography profile? Is it some kind of regex matching or completely manual?
How should the profile look so that it can be processed by lexibank package?