lexibank / sabor

CLDF datasets accompanying investigations on automated borrowing detection in SA languages
Creative Commons Attribution 4.0 International
0 stars 0 forks source link

Lack Wichi and Mapudungun languages, donor_value, ... #1

Closed fractaldragonflies closed 2 years ago

fractaldragonflies commented 2 years ago

Did clean install of Sabor and activation steps. Files were replaced with updated versions. Some essential and optional issues.

I haven't actually tried to access yet with any code. Am assuming that some light adaption will be necessary, but hopefully nothing dramatic!

report-example.txt

LinguList commented 2 years ago

Okay. You can easily solve most issues.

Just replace the profiles by the ones you want snd keep their current names.

OK, cmd_makecldf overwrites forms and leaves empty fields for segments, graphemes, profile for the ids languages. Wold languages have segments, but no graphemes nor profile either.

But, I can take our model from KeyPano to use in upgrading use the Spanish or Portuguese profiles. So I'll work it that way.

LinguList commented 2 years ago

If languages are missing, check their name in etc/languages.tsv and adjust it. I took the language name from a file you sent me, so some are probsbly wrong or simply missing

It was a bit more subtle than an incorrect name. Mapudungun and Wichí occurred both as WOLD and IDS languages. The IDS language reference was overwriting the WOLD reference so when matching on WOLD during the add step, they were never found. I put a test that the prefix was 'wold-' and it seems to get all 7 now. Hmmm, but edictor/savor.tsv hasn't been updated. OK, I'll follow your earlier email advice: One more thing: to obtain the wordlist, I use now "pyedictor" (pip install pyedictor), I added a folder edictor/Makefile, where you can type : make wordlist and the wordlist will be created. Can also be used from within Python scripts!
LinguList commented 2 years ago

For donor values, can you check my code and check wold data, where I find the value?

OK, I belatedly read Mattis' suggestion below to look into the Borrowing.csv. There we find source_word (donor_word) and source_languoid (donor_language). This should do it. Thanks.

There was change in WOLD DB structure which confused me.

Remains a mystery. Versions of WOLD used with PyBor and a version I installed mid-year 2021 have donor_source, donor_value, and other fields. These are NOT included in the version of WOLD installed with SaBor. The SaBor version of WOLD appears to have truncated such fields. CLDF json representations are different as well. Of note, the CLDF json for SaBor seems to match with the Forms.csv, BUT the CLDF json for earlier WOLD matches with its Forms.csv ONLY up the the donor_language field. Attached is a file showing the difference headers and sample annotation of borrowing. [WOLD-discrepancies.txt](https://github.com/lexibank/sabor/files/8346030/WOLD-discrepancies.txt)
LinguList commented 2 years ago

Borrowed os easy to modify in the code now.

I've coded True for Borrowed_score > 0.9 else False. At some time we might want to experiment with >0.6 instead. Although 0.0 and 1.0 probabilities are the overwhelmingly most common.

There is a surprising number of cases where the Borrowed_score = 0.0, but a Donor_language and Donor_value is given. These would seem to be errors, but I don't know what the correction would be.

LinguList commented 2 years ago

You can make changes snd post as pr.

LinguList commented 2 years ago

@fractaldragonflies, I can also do this later, but if you are impatient (as I would be), please have a look at the cldf data, and my code to get the language data, as this will maybe clarify some points of failure (e.g., why some languages don't get included now)

fractaldragonflies commented 2 years ago

I'll give it a go... as you said most fixes should be straight forward. And I need to learn this! I'm still not clear where the donor_value comes from but that should be discoverable form the cldf of our earlier work in Pybor if not obvious in the Sabor. Thanks!

LinguList commented 2 years ago

It is in the wold/cldf/borrowings.csv table, which I load right in the beginning of the lexibank script! There, you see that the donor value is there, and can be loaded, but is called differently!

OK, sorry I didn't take this into account above.

fractaldragonflies commented 2 years ago

Issues were resolved by the improvestore branch and pull request.