Open fractaldragonflies opened 2 years ago
Can you point me to the code?
Sorry, did check my email for awhile.
classifier.py 107-118 I define the partial functions.
By default the partial methods are not used, just the functions. The --method argument lets us set SCA, LEX or the both.
** Note, finalizing cognate with edit-dist AND solved problem of access to lexstat I think. Will test with cross-validate too.
In adding edit-dist to cognate based, I saw a huge drop between train and test, indicative of an Error in how I did test.
Indeed there was an error. Little effect on the SCA cognate based because the default constructor used the SCA method. Although some degradation because default was 'overlap' while training was with 'global'.
But a significant different with edit-dist because I trained on edit-dist and tested on SCA using the optimal threshold for edit-dist. Now little loss in F1 score for using test.
With this finding I was able to return to use LexStat with Cognate Based as well. Still needed to solve the scorer issue, which I did by just copying from the training scorer. But now LexStat also works for content based... and quite well, even if it takes a lot more training time.
Here are updated results for Cognate based SCA and new results for edit-dist and LexStat.
CB F1 score: 0.77 (ned) <~ 0.78 (sca) < 0.80 (lex).
I'll look at Closest more carefully too. Cursory review says not a problem, but given this, worth checking more carefully.
cldfbench sabor.crossvalidate 10 --method cbned
10-fold cross-validation on splits directory using cognate_based_cognate_ned.
fn fp tn tp precision recall f1 fb accuracy threshold fold
---- ---- ----- ----- ----------- -------- ----- ----- ---------- ----------- ------
35.0 17.0 880.0 90.0 0.841 0.720 0.776 0.776 0.949 0.66 0
44.0 11.0 895.0 96.0 0.897 0.686 0.777 0.777 0.947 0.66 1
36.0 18.0 903.0 75.0 0.806 0.676 0.735 0.735 0.948 0.66 2
64.0 13.0 797.0 135.0 0.912 0.678 0.778 0.778 0.924 0.66 3
48.0 17.0 830.0 120.0 0.876 0.714 0.787 0.787 0.936 0.66 4
41.0 25.0 834.0 96.0 0.793 0.701 0.744 0.744 0.934 0.66 5
38.0 21.0 852.0 110.0 0.840 0.743 0.789 0.789 0.942 0.66 6
47.0 12.0 865.0 85.0 0.876 0.644 0.742 0.742 0.942 0.66 7
38.0 24.0 944.0 96.0 0.800 0.716 0.756 0.756 0.944 0.66 8
42.0 17.0 878.0 141.0 0.892 0.770 0.827 0.827 0.945 0.66 9
43.3 17.5 867.8 104.4 0.853 0.705 0.771 0.771 0.941 0.66 mean
8.5 4.8 42.0 21.6 0.043 0.036 0.028 0.028 0.008 0.00 stdev
cldfbench sabor.crossvalidate 10 --method cbsca [Uses corrected predict]
10-fold cross-validation on splits directory using cognate_based_cognate_sca.
fn fp tn tp precision recall f1 fb accuracy threshold fold
---- ---- ----- ----- ----------- -------- ----- ----- ---------- ----------- ------
32.0 20.0 877.0 93.0 0.823 0.744 0.782 0.782 0.949 0.46 0
45.0 13.0 893.0 95.0 0.880 0.679 0.766 0.766 0.945 0.46 1
36.0 17.0 904.0 75.0 0.815 0.676 0.739 0.739 0.949 0.46 2
61.0 14.0 796.0 138.0 0.908 0.693 0.786 0.786 0.926 0.44 3
44.0 16.0 831.0 124.0 0.886 0.738 0.805 0.805 0.941 0.46 4
35.0 19.0 840.0 102.0 0.843 0.745 0.791 0.791 0.946 0.46 5
33.0 14.0 859.0 115.0 0.891 0.777 0.830 0.830 0.954 0.46 6
50.0 18.0 859.0 82.0 0.820 0.621 0.707 0.707 0.933 0.46 7
34.0 18.0 950.0 100.0 0.847 0.746 0.794 0.794 0.953 0.46 8
42.0 15.0 880.0 141.0 0.904 0.770 0.832 0.832 0.947 0.44 9
41.2 16.4 868.9 106.5 0.862 0.719 0.783 0.783 0.944 0.46 mean
9.2 2.4 42.6 22.4 0.036 0.050 0.038 0.038 0.009 0.01 stdev
cldfbench sabor.crossvalidate 10 --method cblex
10-fold cross-validation on splits directory using cognate_based_cognate_lex.
fn fp tn tp precision recall f1 fb accuracy threshold fold
---- ---- ----- ----- ----------- -------- ----- ----- ---------- ----------- ------
39.0 11.0 886.0 86.0 0.887 0.688 0.775 0.775 0.951 0.74 0
41.0 6.0 900.0 99.0 0.943 0.707 0.808 0.808 0.955 0.74 1
32.0 10.0 911.0 79.0 0.888 0.712 0.790 0.790 0.959 0.74 2
66.0 7.0 803.0 133.0 0.950 0.668 0.785 0.785 0.928 0.74 3
49.0 11.0 836.0 119.0 0.915 0.708 0.799 0.799 0.941 0.74 4
32.0 14.0 845.0 105.0 0.882 0.766 0.820 0.820 0.954 0.74 5
37.0 10.0 863.0 111.0 0.917 0.750 0.825 0.825 0.954 0.74 6
47.0 9.0 868.0 85.0 0.904 0.644 0.752 0.752 0.944 0.74 7
37.0 12.0 956.0 97.0 0.890 0.724 0.798 0.798 0.956 0.74 8
42.0 13.0 882.0 141.0 0.916 0.770 0.837 0.837 0.949 0.74 9
42.2 10.3 875.0 105.5 0.909 0.714 0.799 0.799 0.949 0.74 mean
10.1 2.5 42.8 20.7 0.024 0.041 0.025 0.025 0.009 0.00 stdev
Interesting. But for the lexstat results, I am not sure if it makes really sense to report them. The extremely high threshold of 0.75 is in strong contrast with the typically best threshold for cognate detection of 0.55. SCA, on the other hand, shows the same threshold of 0.45, which we also inferred to be the best based on comparison with lots of datasets. Similarly, NED is around 0.7 in our tests, and 0.66 is again consistent here, right? So even if these lexstat scores seem to add something, they are based on a strong contrast between the thresholds we typically use to find cognates and those which we need to find borrowings. So I am not entirely sure if we can justify this properly.
Additional runs on overlap and local and repeat runs (to show consistency) separated by module Closest, Cognate, Classifier.
Test F1 Scores Closest: edit-dist 0.76, sca-global 0.79, sca-overlap 0.78, sca-local 0.78 Cognate: edit-dist 0.77, sca-global 0.78, sea-overlap 0.78, sca-local 0.78, lex-global 0.80 no CVs for lex overlap or local run yet. Classifier: svm-full 0.81, sim-fast 0.80, svm-all-funcs 0.81 did not repeat CVs for other combinations yet. all-funcs means edit-dist, sca-global, sca-overlap, sca-local. svm-full means edit-dist, sca-global, lexstat-sca, lexstat-lex
No conflicts with previous runs above other than for the original Cognate based sca results already noted above.
I'll push these files in separate branch as well along with the additions for Cognate edit-dist and correction to predict and other tweaks to support better reporting of runs.
10-fold-CV-runs-Classify.txt 10-fold-CV-runs-Closest-CV.txt 10-fold-CV-runs-Cognate-CV.txt Individual-commands-fold-0.txt
Thanks. I'd only ask to not put things in separate branches. I find it very difficult to organize my code in branches and prefer to use a branch only to advance in one direction and then merge.
Thanks. I'd only ask to not put things in separate branches. I find it very difficult to organize my code in branches and prefer to use a branch only to advance in one direction and then merge.
Sorry, I pushed before seeing your note to not put in branch with the other update. For the future, I'll put such in main.
Interesting. But for the lexstat results, I am not sure if it makes really sense to report them. The extremely high threshold of 0.75 is in strong contrast with the typically best threshold for cognate detection of 0.55. SCA, on the other hand, shows the same threshold of 0.45, which we also inferred to be the best based on comparison with lots of datasets. Similarly, NED is around 0.7 in our tests, and 0.66 is again consistent here, right? So even if these lexstat scores seem to add something, they are based on a strong contrast between the thresholds we typically use to find cognates and those which we need to find borrowings. So I am not entirely sure if we can justify this properly.
OK, I'll not put the LexStat result in proposed tables or writing for now. For previous exploration I've done with Sca and LexStat, the LexStat has always run higher than Sca for optimal threshold, but 0.75 is higher than I've seen in my trials too (0.65 was more typical though). Note this was LexStat with upgma, global.
Lexstat with UPGMA normally has 0.6. Note that lexstat has a smoothing component, according to which the score is composed of the SCA score to some degree, in order to make sure that it can still yield some interesting results in the absence of good signal. So I assume that with higher thresholds Lexstat comes closer to SCA, or reaches some limbo, where we cannot really guarantee that the results are good or bad.
Note also that it is easier to ignore it in general for now because:
Agreed. Lexstat's advantage over sca seems to be, in part, the increased precision attainable as a function of the number of runs. Here, 50 pts higher than 'sca' for runs=5,000. What if we optimized on FB score where B is the relative importance of recall versus precision? [Not for now, but to consider.]
10-fold cross-validation on splits directory using cognate_based_cognate_lex_global.
fn fp tn tp precision recall f1 fb accuracy threshold fold
---- ---- ----- ----- ----------- -------- ----- ----- ---------- ----------- ------
42.3 9.9 875.4 105.4 0.911 0.713 0.799 0.799 0.950 0.74 mean
10.4 3.3 42.0 21.0 0.034 0.045 0.030 0.030 0.008 0.00 stdev
I had already implemented FB score where B is configurable, but defaults to 1.0 giving F1 score. We could easily change the optimization criterion to prefer recall over precision with B=2 (twice as important).
We can consider too explore the LexStat behavior in a future study, definitely.
For now, only, any distraction from a first paper should be avoided. So sticking to -- what I may have not made clear enough -- the major plan of ignoring all method="lexstat" results, is crucial, I think. I gave up explicitly to deal with lexstat methods when we decided to ignore the multi-threshold method, as this requires more data, etc., and does not fit the dominant-language scenario very well.
Not sure what scores these would be.
I do already access the LexStat object and use align_pairs to compute sca and lexstat alignments returning distances - if these are the different scores that you mean.
Could you to take at how I did my partial methods for sca and LexStat too. I think my including sca as LexStat align_pairs method and separately as pairwise function, may be redundant. The LexStat method is obviously different, but I see no gain from it either… except in more time to process.