lexibank / sabor

CLDF datasets accompanying investigations on automated borrowing detection in SA languages
Creative Commons Attribution 4.0 International
0 stars 0 forks source link

Two different scores based on aligned sequences... #28

Open fractaldragonflies opened 2 years ago

fractaldragonflies commented 2 years ago

There are also two different scores based on aligned sequences, which we could add and test for the classifier. I could also look into this, if you want, add an issue and assign it to me ;)

Not sure what scores these would be.

I do already access the LexStat object and use align_pairs to compute sca and lexstat alignments returning distances - if these are the different scores that you mean.

Could you to take at how I did my partial methods for sca and LexStat too. I think my including sca as LexStat align_pairs method and separately as pairwise function, may be redundant. The LexStat method is obviously different, but I see no gain from it either… except in more time to process.

LinguList commented 2 years ago

Can you point me to the code?

fractaldragonflies commented 2 years ago

Sorry, did check my email for awhile.

classifier.py 107-118 I define the partial functions.

By default the partial methods are not used, just the functions. The --method argument lets us set SCA, LEX or the both.

** Note, finalizing cognate with edit-dist AND solved problem of access to lexstat I think. Will test with cross-validate too.

fractaldragonflies commented 2 years ago

In adding edit-dist to cognate based, I saw a huge drop between train and test, indicative of an Error in how I did test.

Indeed there was an error. Little effect on the SCA cognate based because the default constructor used the SCA method. Although some degradation because default was 'overlap' while training was with 'global'.

But a significant different with edit-dist because I trained on edit-dist and tested on SCA using the optimal threshold for edit-dist. Now little loss in F1 score for using test.

With this finding I was able to return to use LexStat with Cognate Based as well. Still needed to solve the scorer issue, which I did by just copying from the training scorer. But now LexStat also works for content based... and quite well, even if it takes a lot more training time.

Here are updated results for Cognate based SCA and new results for edit-dist and LexStat.

CB F1 score: 0.77 (ned) <~ 0.78 (sca) < 0.80 (lex).

I'll look at Closest more carefully too. Cursory review says not a problem, but given this, worth checking more carefully.

cldfbench sabor.crossvalidate 10 --method cbned
10-fold cross-validation on splits directory using cognate_based_cognate_ned.
  fn    fp     tn     tp    precision    recall     f1     fb    accuracy    threshold  fold
----  ----  -----  -----  -----------  --------  -----  -----  ----------  -----------  ------
35.0  17.0  880.0   90.0        0.841     0.720  0.776  0.776       0.949         0.66  0
44.0  11.0  895.0   96.0        0.897     0.686  0.777  0.777       0.947         0.66  1
36.0  18.0  903.0   75.0        0.806     0.676  0.735  0.735       0.948         0.66  2
64.0  13.0  797.0  135.0        0.912     0.678  0.778  0.778       0.924         0.66  3
48.0  17.0  830.0  120.0        0.876     0.714  0.787  0.787       0.936         0.66  4
41.0  25.0  834.0   96.0        0.793     0.701  0.744  0.744       0.934         0.66  5
38.0  21.0  852.0  110.0        0.840     0.743  0.789  0.789       0.942         0.66  6
47.0  12.0  865.0   85.0        0.876     0.644  0.742  0.742       0.942         0.66  7
38.0  24.0  944.0   96.0        0.800     0.716  0.756  0.756       0.944         0.66  8
42.0  17.0  878.0  141.0        0.892     0.770  0.827  0.827       0.945         0.66  9
43.3  17.5  867.8  104.4        0.853     0.705  0.771  0.771       0.941         0.66  mean
 8.5   4.8   42.0   21.6        0.043     0.036  0.028  0.028       0.008         0.00  stdev

cldfbench sabor.crossvalidate 10 --method cbsca  [Uses corrected predict]
10-fold cross-validation on splits directory using cognate_based_cognate_sca.
  fn    fp     tn     tp    precision    recall     f1     fb    accuracy    threshold  fold
----  ----  -----  -----  -----------  --------  -----  -----  ----------  -----------  ------
32.0  20.0  877.0   93.0        0.823     0.744  0.782  0.782       0.949         0.46  0
45.0  13.0  893.0   95.0        0.880     0.679  0.766  0.766       0.945         0.46  1
36.0  17.0  904.0   75.0        0.815     0.676  0.739  0.739       0.949         0.46  2
61.0  14.0  796.0  138.0        0.908     0.693  0.786  0.786       0.926         0.44  3
44.0  16.0  831.0  124.0        0.886     0.738  0.805  0.805       0.941         0.46  4
35.0  19.0  840.0  102.0        0.843     0.745  0.791  0.791       0.946         0.46  5
33.0  14.0  859.0  115.0        0.891     0.777  0.830  0.830       0.954         0.46  6
50.0  18.0  859.0   82.0        0.820     0.621  0.707  0.707       0.933         0.46  7
34.0  18.0  950.0  100.0        0.847     0.746  0.794  0.794       0.953         0.46  8
42.0  15.0  880.0  141.0        0.904     0.770  0.832  0.832       0.947         0.44  9
41.2  16.4  868.9  106.5        0.862     0.719  0.783  0.783       0.944         0.46  mean
 9.2   2.4   42.6   22.4        0.036     0.050  0.038  0.038       0.009         0.01  stdev

cldfbench sabor.crossvalidate 10 --method cblex
10-fold cross-validation on splits directory using cognate_based_cognate_lex.
  fn    fp     tn     tp    precision    recall     f1     fb    accuracy    threshold  fold
----  ----  -----  -----  -----------  --------  -----  -----  ----------  -----------  ------
39.0  11.0  886.0   86.0        0.887     0.688  0.775  0.775       0.951         0.74  0
41.0   6.0  900.0   99.0        0.943     0.707  0.808  0.808       0.955         0.74  1
32.0  10.0  911.0   79.0        0.888     0.712  0.790  0.790       0.959         0.74  2
66.0   7.0  803.0  133.0        0.950     0.668  0.785  0.785       0.928         0.74  3
49.0  11.0  836.0  119.0        0.915     0.708  0.799  0.799       0.941         0.74  4
32.0  14.0  845.0  105.0        0.882     0.766  0.820  0.820       0.954         0.74  5
37.0  10.0  863.0  111.0        0.917     0.750  0.825  0.825       0.954         0.74  6
47.0   9.0  868.0   85.0        0.904     0.644  0.752  0.752       0.944         0.74  7
37.0  12.0  956.0   97.0        0.890     0.724  0.798  0.798       0.956         0.74  8
42.0  13.0  882.0  141.0        0.916     0.770  0.837  0.837       0.949         0.74  9
42.2  10.3  875.0  105.5        0.909     0.714  0.799  0.799       0.949         0.74  mean
10.1   2.5   42.8   20.7        0.024     0.041  0.025  0.025       0.009         0.00  stdev
LinguList commented 2 years ago

Interesting. But for the lexstat results, I am not sure if it makes really sense to report them. The extremely high threshold of 0.75 is in strong contrast with the typically best threshold for cognate detection of 0.55. SCA, on the other hand, shows the same threshold of 0.45, which we also inferred to be the best based on comparison with lots of datasets. Similarly, NED is around 0.7 in our tests, and 0.66 is again consistent here, right? So even if these lexstat scores seem to add something, they are based on a strong contrast between the thresholds we typically use to find cognates and those which we need to find borrowings. So I am not entirely sure if we can justify this properly.

fractaldragonflies commented 2 years ago

Additional runs on overlap and local and repeat runs (to show consistency) separated by module Closest, Cognate, Classifier.

Test F1 Scores Closest: edit-dist 0.76, sca-global 0.79, sca-overlap 0.78, sca-local 0.78 Cognate: edit-dist 0.77, sca-global 0.78, sea-overlap 0.78, sca-local 0.78, lex-global 0.80 no CVs for lex overlap or local run yet. Classifier: svm-full 0.81, sim-fast 0.80, svm-all-funcs 0.81 did not repeat CVs for other combinations yet. all-funcs means edit-dist, sca-global, sca-overlap, sca-local. svm-full means edit-dist, sca-global, lexstat-sca, lexstat-lex

No conflicts with previous runs above other than for the original Cognate based sca results already noted above.

I'll push these files in separate branch as well along with the additions for Cognate edit-dist and correction to predict and other tweaks to support better reporting of runs.

10-fold-CV-runs-Classify.txt 10-fold-CV-runs-Closest-CV.txt 10-fold-CV-runs-Cognate-CV.txt Individual-commands-fold-0.txt

LinguList commented 2 years ago

Thanks. I'd only ask to not put things in separate branches. I find it very difficult to organize my code in branches and prefer to use a branch only to advance in one direction and then merge.

fractaldragonflies commented 2 years ago

Thanks. I'd only ask to not put things in separate branches. I find it very difficult to organize my code in branches and prefer to use a branch only to advance in one direction and then merge.

Sorry, I pushed before seeing your note to not put in branch with the other update. For the future, I'll put such in main.

fractaldragonflies commented 2 years ago

Interesting. But for the lexstat results, I am not sure if it makes really sense to report them. The extremely high threshold of 0.75 is in strong contrast with the typically best threshold for cognate detection of 0.55. SCA, on the other hand, shows the same threshold of 0.45, which we also inferred to be the best based on comparison with lots of datasets. Similarly, NED is around 0.7 in our tests, and 0.66 is again consistent here, right? So even if these lexstat scores seem to add something, they are based on a strong contrast between the thresholds we typically use to find cognates and those which we need to find borrowings. So I am not entirely sure if we can justify this properly.

OK, I'll not put the LexStat result in proposed tables or writing for now. For previous exploration I've done with Sca and LexStat, the LexStat has always run higher than Sca for optimal threshold, but 0.75 is higher than I've seen in my trials too (0.65 was more typical though). Note this was LexStat with upgma, global.

LinguList commented 2 years ago

Lexstat with UPGMA normally has 0.6. Note that lexstat has a smoothing component, according to which the score is composed of the SCA score to some degree, in order to make sure that it can still yield some interesting results in the absence of good signal. So I assume that with higher thresholds Lexstat comes closer to SCA, or reaches some limbo, where we cannot really guarantee that the results are good or bad.

LinguList commented 2 years ago

Note also that it is easier to ignore it in general for now because:

fractaldragonflies commented 2 years ago

Agreed. Lexstat's advantage over sca seems to be, in part, the increased precision attainable as a function of the number of runs. Here, 50 pts higher than 'sca' for runs=5,000. What if we optimized on FB score where B is the relative importance of recall versus precision? [Not for now, but to consider.]

10-fold cross-validation on splits directory using cognate_based_cognate_lex_global.

 fn    fp     tn     tp    precision    recall     f1     fb    accuracy    threshold  fold
----  ----  -----  -----  -----------  --------  -----  -----  ----------  -----------  ------
42.3   9.9  875.4  105.4        0.911     0.713  0.799  0.799       0.950         0.74  mean
10.4   3.3   42.0   21.0        0.034     0.045  0.030  0.030       0.008         0.00  stdev

I had already implemented FB score where B is configurable, but defaults to 1.0 giving F1 score. We could easily change the optimization criterion to prefer recall over precision with B=2 (twice as important).

LinguList commented 2 years ago

We can consider too explore the LexStat behavior in a future study, definitely.

LinguList commented 2 years ago

For now, only, any distraction from a first paper should be avoided. So sticking to -- what I may have not made clear enough -- the major plan of ignoring all method="lexstat" results, is crucial, I think. I gave up explicitly to deal with lexstat methods when we decided to ignore the multi-threshold method, as this requires more data, etc., and does not fit the dominant-language scenario very well.