Closed LinguList closed 2 years ago
I performed the merge, but with commentary. I'll install and play with it shortly.
Here I repeat my commentary from the merge:
So the BIG discrepancies seem to be:
OK, I'm installing the new merge now and will play with this.
I am sorry if I am causing confusion and anger. And there is a lot of confusion remaining. Let me try to explain what my major problems were and are:
sabor
library inside a cldf dataset clearly disrupted the installation, it conflicts also with one major principle that says: libraries go to pypi
, while data with some extra code that analyses a specific dataset go to zenodo
alter. If you check the two functions I wrote now to make the pairwise comparison work, you see that they are simple enough to be included into lingrex
at a later point, if we learn that they work well, and they are completely extendable or can be combined, as they have simple parameters and they follow the major procedure we use in lingpy in writing results that are also useful for concrete work to the wordlist filesabor
library: it is not a "library" in the sense of a code that I could use in another program, but really a collection of code that is only usable for this specific analysis, so I do not see where it can be later included into larger libraries, since I have so heavy problems understanding where the actual code lies. I think that the code that exemplifies just how a pairwise borrowing detection with explicit donors could work is showing that this does not need to be that complicated: we have one function that does the iteration, and it uses another function that does the sequence comparison, plus some parameters that tell how the data should be written to the wordlist / file. Note that the confusion matrix was just added by me because I was too lazy to look up what part is f-score, what part is accuracy, etc. It has nothing to do with desired output, it is merely a way to show that storing results in a wordlist itself has many advantages, since the wordlist can be readily read by lingpy, while other TSV formats cannot. It has one more major advantage: the result looks like the gold standard, which also contains a donor_language
, etc., so one can even open the data in EDICTOR and check the results manually, you see? I have to admit that it did not occur to me before, but obviously this is always the way in which we work in lingpy: we write results to file, and we then check them manually. I always had the feeling that this is NOT possible, so it was a crucial step for me now.
Note, last not least, that writing a test for the code I wrote is extremely easy, once it is included in an extra library. And the code can already be used for any other lexibank dataset, since it simply takes a lingpy-wordlist, a language, and then does the searching. For lingrex and the simple multiple comparison, I'll add similar functions of similar small size.
If you want, I'll do a careful comparison of the code, but please understand that my major problem through all this time was that I could not find the core operation, the core algorithm that was doing something, as it was hidden among a lot of other code that was mainly dealing with argument parsing.
Trying to get a handle on my tasks then. I've inserted some queries between quoted parts of your comment.
Then to run all analyses (many thresholds, edit distance vs. SCA), run:
$ cldfbench sabor.get_borrowings --full
I'll later re-name this to "get_borrowings_form_pairs" or similar (so please do not modify the filename now), and add the same for lingrex etc.
To evaluate, just run:
$ cldfbench sabor.evaluate
This yields:
-------------- -------- ------------ ----- borrowed not borrowed total identified 111 7 118 not identified 1451 11938 13389 total 1562 11945 13507 -------------- -------- ------------ -----
You can modify this script (please do not add too many parameters, let us keep it simple now!) to account for real scores, like accuracy or whatever.
Another example with SCA:
$ cldfbench sabor.evaluate --file=store/pw-spa-SCA-0.30.tsv
Note that this searches for borrowings only from Spanish, pure classification.
In the earlier pairwise report we captured individually Spanish and Portuguese and the Combination of the two - as a language family, for shared cognates that is. I hadn't run any just Spanish previously, but certainly it was/is part of our capability.
With respect to @LinguList numbered comments...
OK... to work!
I answer on your points above in the order there.
BORID
that denotes: this word X and this word Y are either cognate or one is borrowed from the other. Since the partial approach with a dedicated donor "finds" the source as well (defined by the expert), we definitely also want to have that info in the file, so it is up to me to add this to the current pairwise function. We have the donor_id, so we don't need the donor_value, as the donor_id is referencing to the value, right? But sure, it can be added, as this may make checking easier.My proposal is that I will adjust pairwise now (adding a general BORID) and also add the lingrex and multiple functions (in separate files, I think). Same principle: we can later use the functions that make the main work and just put them into the lingrex library, where we'd add unit tests for them.
I'd leave the evaluation to you, as you have a clearer plan here, on what needs to be done, but we'd assume a unified format, based on fields (that can be passed as parameter), I will call them "gold" vs. "test" (=automated approach):
source_language
, so the params could be gold_donor
and test_donor
. I don't think that for this kind of prediction any other specific test needs to be made: if the language is the same, or if it is different, etc. But we can also evaluate as we do in lingpy: donor_detection(wordlist, gold, test)
, three params, wordlist, reference to gold column, reference to test column, no defaults.borid
, which we typically call ref
, so if we calculate values here, we can just calculate b-cubed scores, using lingpy. Any other aspect we want to evaluate?
@fractaldragonflies, let us re-boot everything. I propose new code, but specifically, I propose a very new way to store results. We write them now to the wordlist in columns, and to store results, we write the wordlist to file. That means, we can then evaluate with another script, and don't need to do this in the same script.
I illustrate the new procedure with a very lightweight version of pairwise borrowings. This is but some dozen lines long but does exactly what we want and I already ran numerous test analyses, and it can be easily extended, but the "library" part can later be added to lingrex.
To runt his, just type:
This yields first examples.
Then to run all analyses (many thresholds, edit distance vs. SCA), run:
I'll later re-name this to "get_borrowings_form_pairs" or similar (so please do not modify the filename now), and add the same for lingrex etc.
To evaluate, just run:
This yields:
You can modify this script (please do not add too many parameters, let us keep it simple now!) to account for real scores, like accuracy or whatever.
Another example with SCA:
Output is:
Note that this searches for borrowings only from Spanish, pure classification.