lexibank / sabor

CLDF datasets accompanying investigations on automated borrowing detection in SA languages
Creative Commons Attribution 4.0 International
0 stars 0 forks source link

Addon1 #6

Closed LinguList closed 2 years ago

LinguList commented 2 years ago

@fractaldragonflies, let us re-boot everything. I propose new code, but specifically, I propose a very new way to store results. We write them now to the wordlist in columns, and to store results, we write the wordlist to file. That means, we can then evaluate with another script, and don't need to do this in the same script.

I illustrate the new procedure with a very lightweight version of pairwise borrowings. This is but some dozen lines long but does exactly what we want and I already ran numerous test analyses, and it can be easily extended, but the "library" part can later be added to lingrex.

To runt his, just type:

$ pip install -e . # this will ignore all sabor and use new saborncommands folder to start anew
$ cldfbench sabor.get_borrowings

This yields first examples.

Then to run all analyses (many thresholds, edit distance vs. SCA), run:

$ cldfbench sabor.get_borrowings --full

I'll later re-name this to "get_borrowings_form_pairs" or similar (so please do not modify the filename now), and add the same for lingrex etc.

To evaluate, just run:

$ cldfbench sabor.evaluate

This yields:

--------------  --------  ------------  -----
                borrowed  not borrowed  total
identified      111       7             118
not identified  1451      11938         13389
total           1562      11945         13507
--------------  --------  ------------  -----

You can modify this script (please do not add too many parameters, let us keep it simple now!) to account for real scores, like accuracy or whatever.

Another example with SCA:

$ cldfbench sabor.evaluate --file=store/pw-spa-SCA-0.30.tsv

Output is:

--------------  --------  ------------  -----
                borrowed  not borrowed  total
identified      924       1001          1925
not identified  638       10955         11593
total           1562      11956         13518
--------------  --------  ------------  -----

Note that this searches for borrowings only from Spanish, pure classification.

fractaldragonflies commented 2 years ago

I performed the merge, but with commentary. I'll install and play with it shortly.

Here I repeat my commentary from the merge:

So the BIG discrepancies seem to be:

OK, I'm installing the new merge now and will play with this.

LinguList commented 2 years ago

I am sorry if I am causing confusion and anger. And there is a lot of confusion remaining. Let me try to explain what my major problems were and are:

  1. there is a large bunch of code, and I am really not following many of the functions that are sometimes placed in the library in SABOR and in other parts, since many functions are merely there to load one dataset, or there was even a library function that would handle parameters. I consider these as parts that make it later very hard to see what the core method is doing actually. So what is the part that runs the comparison of datasets? Where is the function that compares too strings? How does one handle the thresholds?
  2. the sabor library inside a cldf dataset clearly disrupted the installation, it conflicts also with one major principle that says: libraries go to pypi, while data with some extra code that analyses a specific dataset go to zenodo alter. If you check the two functions I wrote now to make the pairwise comparison work, you see that they are simple enough to be included into lingrex at a later point, if we learn that they work well, and they are completely extendable or can be combined, as they have simple parameters and they follow the major procedure we use in lingpy in writing results that are also useful for concrete work to the wordlist file
  3. testing is becoming almost impossible when mixing library code with code that is only there too load a very specific dataset or to load parameters, but it seemed to me that this was exactly what happened in the sabor library: it is not a "library" in the sense of a code that I could use in another program, but really a collection of code that is only usable for this specific analysis, so I do not see where it can be later included into larger libraries, since I have so heavy problems understanding where the actual code lies. I think that the code that exemplifies just how a pairwise borrowing detection with explicit donors could work is showing that this does not need to be that complicated: we have one function that does the iteration, and it uses another function that does the sequence comparison, plus some parameters that tell how the data should be written to the wordlist / file.

Note that the confusion matrix was just added by me because I was too lazy to look up what part is f-score, what part is accuracy, etc. It has nothing to do with desired output, it is merely a way to show that storing results in a wordlist itself has many advantages, since the wordlist can be readily read by lingpy, while other TSV formats cannot. It has one more major advantage: the result looks like the gold standard, which also contains a donor_language, etc., so one can even open the data in EDICTOR and check the results manually, you see? I have to admit that it did not occur to me before, but obviously this is always the way in which we work in lingpy: we write results to file, and we then check them manually. I always had the feeling that this is NOT possible, so it was a crucial step for me now.

Note, last not least, that writing a test for the code I wrote is extremely easy, once it is included in an extra library. And the code can already be used for any other lexibank dataset, since it simply takes a lingpy-wordlist, a language, and then does the searching. For lingrex and the simple multiple comparison, I'll add similar functions of similar small size.

If you want, I'll do a careful comparison of the code, but please understand that my major problem through all this time was that I could not find the core operation, the core algorithm that was doing something, as it was hidden among a lot of other code that was mainly dealing with argument parsing.

fractaldragonflies commented 2 years ago

Trying to get a handle on my tasks then. I've inserted some queries between quoted parts of your comment.

Then to run all analyses (many thresholds, edit distance vs. SCA), run:

$ cldfbench sabor.get_borrowings --full

I'll later re-name this to "get_borrowings_form_pairs" or similar (so please do not modify the filename now), and add the same for lingrex etc.

To evaluate, just run:

$ cldfbench sabor.evaluate

This yields:

--------------  --------  ------------  -----
                borrowed  not borrowed  total
identified      111       7             118
not identified  1451      11938         13389
total           1562      11945         13507
--------------  --------  ------------  -----

You can modify this script (please do not add too many parameters, let us keep it simple now!) to account for real scores, like accuracy or whatever.

Another example with SCA:

$ cldfbench sabor.evaluate --file=store/pw-spa-SCA-0.30.tsv

Note that this searches for borrowings only from Spanish, pure classification.

In the earlier pairwise report we captured individually Spanish and Portuguese and the Combination of the two - as a language family, for shared cognates that is. I hadn't run any just Spanish previously, but certainly it was/is part of our capability.

fractaldragonflies commented 2 years ago

With respect to @LinguList numbered comments...

  1. The ironic part was that I was pleased with having cleanly partitioned analyze, evaluation and report functions, and having reduced code redundancy. Oh well.
  2. Yes this was a significant error for my part... not knowing there was a standard for this and the problem I introduced with the install.
  3. I thought that this library could be the basis of more than a 1-off analysis. But, no not a general library.
  4. Mentioned earlier. Yes, too many args, and too convoluted their organization.

OK... to work!

LinguList commented 2 years ago

I answer on your points above in the order there.

  1. We can and maybe should add an ID for classes of borrowed words, which we call often BORID that denotes: this word X and this word Y are either cognate or one is borrowed from the other. Since the partial approach with a dedicated donor "finds" the source as well (defined by the expert), we definitely also want to have that info in the file, so it is up to me to add this to the current pairwise function. We have the donor_id, so we don't need the donor_value, as the donor_id is referencing to the value, right? But sure, it can be added, as this may make checking easier.
  2. The new function works by accepting one or more donor languages as parameter. If there is one, one can calculate for one language, if there are more, I'd say the same procedure applies: the computer says "Spanish" but it is either nothing or Portuguese, so it is a false positive, etc. I would not say: "well, but the computer found out it is a borrowing" here, since this is a pure chance match following the logic of the approach.
  3. Yes, you are right. I forgot this. The new version just ranks by similarity. I'd say: the very nature of the pairwise approach is just to find borrowings from one donor, not considering more, as this would mean that one makes some comparison across the words that are from the donor languages. So the way to proceed here is to relax the criterion and to allow for more than one candidate below the threshold, but one should ask oneself is this is a desired behavior. For me, the best and cleanest way is to keep this method as a method that ideally just identifies a single donor for now. The other methods would then do this job!
LinguList commented 2 years ago

My proposal is that I will adjust pairwise now (adding a general BORID) and also add the lingrex and multiple functions (in separate files, I think). Same principle: we can later use the functions that make the main work and just put them into the lingrex library, where we'd add unit tests for them.

LinguList commented 2 years ago

I'd leave the evaluation to you, as you have a clearer plan here, on what needs to be done, but we'd assume a unified format, based on fields (that can be passed as parameter), I will call them "gold" vs. "test" (=automated approach):

  1. for direct donor predictions, we have a field for the source, which we call source_language, so the params could be gold_donor and test_donor. I don't think that for this kind of prediction any other specific test needs to be made: if the language is the same, or if it is different, etc. But we can also evaluate as we do in lingpy: donor_detection(wordlist, gold, test), three params, wordlist, reference to gold column, reference to test column, no defaults.
  2. for the general "belongs to a set of words that contain borrowings" analysis, we have a field like borid, which we typically call ref, so if we calculate values here, we can just calculate b-cubed scores, using lingpy.

Any other aspect we want to evaluate?