lexibank / sabor

CLDF datasets accompanying investigations on automated borrowing detection in SA languages
Creative Commons Attribution 4.0 International
0 stars 0 forks source link

Add cognate borrowing function. Example files. In progress... #17

Closed fractaldragonflies closed 2 years ago

fractaldragonflies commented 2 years ago

Added the initial version of cognate module to accompany our closest match, classifier based. Follows signature of closest match. Has cognate_based_donor_search function and related class. Predict_on_wordlist, but not yet simple predict. Train (I intend to improve this). No arguments, script tries out on 00 fold of training. Work in progress.

Here are results of train:

  iter    threshold    F1 score
------  -----------  ----------
     1          0.6    0.710795
     2          0.7    0.776415
     3          0.8    0.706986
     2          0.7    0.776415

Here are results of detail_evaluate command on train and test from fold 00:

Detection results for: store/CL-predict-CV10-fold-00-train.tsv.
Language             tp    tn    fp    fn    precision    recall    F1 score    accuracy
-----------------  ----  ----  ----  ----  -----------  --------  ----------  ----------
ImbaburaQuechua     221   714    43    65        0.837     0.773       0.804       0.896
Mapudungun          124   936    17    50        0.879     0.713       0.787       0.941
Otomi               129  1828    26    51        0.832     0.717       0.770       0.962
Qeqchi               92  1450    10    55        0.902     0.626       0.739       0.960
Wichi                97   955    11    41        0.898     0.703       0.789       0.953
Yaqui               221   972    19    77        0.921     0.742       0.822       0.926
ZinacantanTzotzil   103   945    15    83        0.873     0.554       0.678       0.914
Overall             987  7800   141   422        0.875     0.700       0.778       0.940

Overall detection results for: store/CL-predict-CV10-fold-00-train.tsv
                  borrowed    not borrowed    total
--------------  ----------  --------------  -------
identified             987             141     1128
not identified         422            7800     8222
total                 1409            7941     9350
Detection results for: store/CL-predict-CV10-fold-00-test.tsv.
Language             tp    tn    fp    fn    precision    recall    F1 score    accuracy
-----------------  ----  ----  ----  ----  -----------  --------  ----------  ----------
ImbaburaQuechua      27    77     6     3        0.818     0.900       0.857       0.920
Mapudungun           12    98     1     4        0.923     0.750       0.828       0.957
Otomi                13   184     3     7        0.812     0.650       0.722       0.952
Qeqchi               10   148     3     5        0.769     0.667       0.714       0.952
Wichi                10    99     1     5        0.909     0.667       0.769       0.948
Yaqui                30   101     3    10        0.909     0.750       0.822       0.910
ZinacantanTzotzil     7   103     0    10        1.000     0.412       0.583       0.917
Overall             109   810    17    44        0.865     0.712       0.781       0.938

Overall detection results for: store/CL-predict-CV10-fold-00-test.tsv
                  borrowed    not borrowed    total
--------------  ----------  --------------  -------
identified             109              17      126
not identified          44             810      854
total                  153             827      980
LinguList commented 2 years ago

But all in all, I think we are advancing now, so thanks a lot and well done. We can soon make real comparisons and check results.

fractaldragonflies commented 2 years ago

Added LingRex to existing cognate_borrowing module as multi-threshold function. Partial functions for cognate_borrowing and multithreshold_borrowing. Same Cognate_borrowing class for both with function as argument. [Perhaps needs a different name that better encompasses Cognate and Multi-threshold functions?]

Performance is on par between Cognate and Multi-Threshold, and between Sca and LexStat. Thresholds vary to achieve optimum. Output wordlists are compatible with sabor.detail_evaluate.

Issue: For unknown reason, I still see Portuguese pop up in my wordlists? Not on donor list but in donor family, so evaluation rightly ignores it, but unknown where they even come from since cldf file seems OK.

Potential issue: Seems that different words for the same concept are combined with LingRex. Not sure if just chance, or there is intent in LingRex to treat words of the same language and concept as cognates? e.g., todo, cada were assigned the same Bor_id.

Have not moved any functions to lexibank_sabor. So far still exclusive to just 1 file.

Still to do with this task: Predict for words, concepts.

LinguList commented 2 years ago

I assume that the cases of cada and todo are due to high thresholds.

fractaldragonflies commented 2 years ago

The method specified is only used for the internal_cognates so I don't see a conflict. Although maybe it should always be LexStat method in which case I could even drop the argument. Currently it lets us specify either SCA or LexStat for the internal_cognates.

Already running the scorer only once, as code recognizes when the internal cogids have been added. I hope to add predict for words/concepts using your pattern from closest-match if applicable. So I don't expect to add too much more code to this.

Bienvenido... yes please, you're welcome to improve use of the get_scorer function in the cognate-based.

@fractaldragonflies, just saw that the multi-threshold uses the same method in both cases. This, however is not intended. We use the lexstat method for internal cognates, and sca for outside cognates. So I am afraid, this function is better split into two parts.

The advantage is also that the expensive get_scorer function only needs to be run one time. Something I mentioned should also be done for the cognate-based donor search.

We can proceed as follows: if you do not add too much more code, I'll modify this later, making the code more efficient in this regard.

"override" for LingRex. I would like to avoid creating multiple Bor_id columns for the outside cognates. I was able to avoid such with the Cognate-based without warnings about overwriting the column, since the cluster function has an "override" argument. Any plans to make "override" available for external_cognates function in LingRex?

Btw, the stored wordlist output by Closest-match also gets pretty busy with 10 columns for different training thresholds. I could modify code without much change I think to avoid this. What do you think?

This store wordlist is what I would look at to diagnose success and failure in detection of borrowed words. [With a spreadsheet, we can filter and group instead of having to code different reports. Although maybe the tp, tn, fp, fn status would be good for filtering on.]

fractaldragonflies commented 2 years ago

Ran several experiments on internal and external thresholds with both LexStat or SCA for the internal method. Except a high (0.7) internal threshold with SCA (poorer), there was almost no impact of internal threshold and almost not difference between LexStat (upgma or infomap) and SCA methods ON peak detection performance. External threshold generally peaked detection performance at ext_threshold=0.4, F1 score=0.76/0.77. Note: there were differences due to int_threshold for ext_thresholds at the lower level and differences between SCA and LexStat as well at these lower levels.

So both thresholds are operational, but the effect of int_threshold is not very sensitive.

Given the rationale of using the internal_cognates for within families, and the fact that only 2 of the 7 languages are from the same family, there just may not be much opportunity for the internal_cognate function to stand out!

fractaldragonflies commented 2 years ago

Added predict for concept/entries modeled after closest match. Added family since required with Cognate based methods. In the predict module I print the results directly. More or less complete based on current requirements. So available to merge if OK, and for @LinguList to modify as required.

Example code for predict from multi_threshold func.

    wl = LexStat(wl)
    bor = CognateBasedBorrowingDetection(
        wl, func=mtbds_sca, donors="Spanish", family="language_family")
    bor.train(thresholds=[0.4], verbose=True)
    bor.predict(
        donors={"Spanish": ["m", "a", "n", "o"]},
        targets={"FakeX": ["m", "a", "n", "u", "ʃ", "k", "a"],
                 "FakeY": ["p", "e", "p", "e", "l"],
                 "FakeZ": ["a", "b", "m", "a", "n", "u"],
                 "RealY":  ["p", "a", "p", "e", "r"]},
        families={"Spanish": "IndoEuropean",
                  "FakeX": "FamA", "FakeY": F"FamB",
                  "FakeZ": "FamA", "RealY": "FamB"})

Finish up analysis with training only at 0.40 (near optimum anyways):

threshold 0.40, f1 score 0.772                                                                                                                                                                                                  
* Training Results *
  index    threshold    F1 score
-------  -----------  ----------
      0          0.4    0.772476
Best threshold 0.400, f1 score 0.772                                                                                                                                                                            

Results of predict:

[INFO] Analyzing words for concept <undefined>.
evaluation at threshold 0.4                                                                                                                                                                                                     
FakeX: Spanish, ['m', 'a', 'n', 'u', 'ʃ', 'k', 'a']: ['m', 'a', 'n', 'o']
FakeZ: Spanish, ['a', 'b', 'm', 'a', 'n', 'u']: ['m', 'a', 'n', 'o']
LinguList commented 2 years ago

@fractaldragonflies, I am in favor of adding more columns, as this preserves the decisions, and makes all the code much easier to handle, as I think one could also see in the first simple-donor class. The overwrite has never been something that was really save, and you can just output some columns passing a wl.output("tsv", columns=[xxx]) argument. So I see many reasons to work with multiple columns.

fractaldragonflies commented 2 years ago

@LinguList , should I merge this branch then to give you leeway to review/revise?

LinguList commented 2 years ago

Yes!