Closed LinguList closed 1 year ago
What is nice for a potential paper here is that we have a very clear baseline, two competing methods, and one new method.
@LinguList, where do we go from here?
We have 4 methods that are seemingly feature complete - for donor focused borrowing detection. We have the previously constructed 10-fold train - test division, and the detail_evaluate module which computes the donor-focused F1 score and related measures - overall and by language.
With respect to venues for publishing a potential paper, today is COLING due date, so obvious no-go.
EMNLP is June 24 due date, so a possibility at least.
I'll followup with Roberto again on a final set of questions I had with respect to KayPano annotations I had made. Hopefully we can have something in Edictor to ask you to review and see if it can be finalized. That, would serve as the next step in further trying out these methods.
Also, I'll proceed to upgrade the cross-validation to work across our methods.
As well as, take a more careful look at @LinguList classifier based method.
evaluate
?partial
they can be created quite easily, or one just makes explicit functions, all fine. Having fitted the SVM, one can check for feature importance, which may be useful.Can I ask you, @fractaldragonflies, to assign tasks to me in issues? Like "check cognate borrowing detection" or "make lexstat work only once", so that I can cross them off later, when I find time?
@fractaldragonflies, just figured I should give you examples for adding more functions. In fact, the local alignment function already accounts for only extracting the valid part of an alignment, so the distance calculated there should be reliable.
One can just vary the modes in the kw to "local", "global", "overlap", "dialign"
, so we have four functions instead of one. As a variant of the edit dist, you can do:
def percentage_identity(almA, almB, mode=1):
almA, almB, _ = nw_align(almA, almB)
return pid(almA, almB, mode=mode)
There are 4 modes, one can vary them again. These are not distances, but similarities.
If you check what I feed to the SVM, you see I feed "props", these are things like the length of the sequence, the language (represented by an integer).
We know, SVM is good at one-hot for categorical variables. So we may do a one-hot of the languages. I can try and add that later.
Working on cross-validation today. OK, will change command names as part of cross-validation, since it will invoke functions from each file to perform. Proposed names below. Hope to get to looking at adding functions to classifier-based soon. But if you have time, you're certainly welcome to continue your work on that function including the 1-hot language property. [OK, follow you 'open issue' suggestion pronto.] On multi-threshold variant of cognate-based, the LexStat internal_cognate function is already executed only ONCE for the various thresholds that may be used for the external_cognate function. I had realized after a few runs, that external_cognate trials were based on the same internal_cognate arguments!
- yes, that may be useful! Maybe, can we change the command names? Shell-scripts rarely have underscores, and something simple telling would be very nice, something like simply
evaluate
?- Adding functions is simple, we can try local, overlap, and all kinds of variants. With
partial
they can be created quite easily, or one just makes explicit functions, all fine. Having fitted the SVM, one can check for feature importance, which may be useful.
How about these names?
Added branch for cross validate work. Pushed 1 commit to branch to show my work with just closest-match functioning, but possibility to extend to others. A bit complicated given all the flexibility we have for the various methods. Current solution is first effort, thinking I could have config file driven but coded in script for now.
Here are results for closest-match:
(lingsabor) sabor % cldfbench sabor.crossvalidate 10
INFO 10-fold cross-validation on splits directory.
fn fp tn tp precision recall f1 fb accuracy threshold fold
---- ---- ----- ----- ----------- -------- ----- ----- ---------- ----------- ------
36.0 11.0 816.0 117.0 0.914 0.765 0.833 0.833 0.952 0.40 0
59.0 15.0 890.0 98.0 0.867 0.624 0.726 0.726 0.930 0.40 1
49.0 15.0 827.0 137.0 0.901 0.737 0.811 0.811 0.938 0.40 2
64.0 8.0 853.0 113.0 0.934 0.638 0.758 0.758 0.931 0.40 3
32.0 13.0 913.0 89.0 0.873 0.736 0.798 0.798 0.957 0.40 4
53.0 20.0 843.0 105.0 0.840 0.665 0.742 0.742 0.929 0.40 5
40.0 17.0 830.0 109.0 0.865 0.732 0.793 0.793 0.943 0.40 6
54.0 8.0 899.0 104.0 0.929 0.658 0.770 0.770 0.942 0.40 7
58.0 20.0 872.0 88.0 0.815 0.603 0.693 0.693 0.925 0.40 8
41.0 18.0 880.0 116.0 0.866 0.739 0.797 0.797 0.944 0.40 9
48.6 14.5 862.3 107.6 0.880 0.690 0.772 0.772 0.939 0.40 mean
10.8 4.5 33.3 14.5 0.039 0.058 0.043 0.043 0.010 0.00 stdev
Nice! This is the closest match with edit distance or with SCA?
Previous was with SCA (mean=0.77). Here is the NED (mean=0.75). I added name of fn to output. Is there a way to suppress the progress bar and merely show the folds? I print the fold with LF, but only to not be erased by the progress bar. It would otherwise look like
0 1 2 3 4 5 6 7 8 9
$ cldfbench sabor.crossvalidate 10
INFO 10-fold cross-validation on splits directory using edit_distance.
folds:
0
1
2
3
4
5
6
7
8
9
fn fp tn tp precision recall f1 fb accuracy threshold fold
---- ---- ----- ----- ----------- -------- ----- ----- ---------- ----------- ------
48.0 17.0 810.0 105.0 0.861 0.686 0.764 0.764 0.934 0.60 0
58.0 9.0 896.0 99.0 0.917 0.631 0.747 0.747 0.937 0.60 1
58.0 12.0 830.0 128.0 0.914 0.688 0.785 0.785 0.932 0.60 2
64.0 12.0 849.0 113.0 0.904 0.638 0.748 0.748 0.927 0.60 3
39.0 18.0 908.0 82.0 0.820 0.678 0.742 0.742 0.946 0.60 4
60.0 15.0 848.0 98.0 0.867 0.620 0.723 0.723 0.927 0.60 5
51.0 13.0 834.0 98.0 0.883 0.658 0.754 0.754 0.936 0.60 6
47.0 15.0 892.0 111.0 0.881 0.703 0.782 0.782 0.942 0.60 7
63.0 27.0 865.0 83.0 0.755 0.568 0.648 0.648 0.913 0.60 8
46.0 14.0 884.0 111.0 0.888 0.707 0.787 0.787 0.943 0.60 9
53.4 15.2 861.6 102.8 0.869 0.658 0.748 0.748 0.934 0.60 mean
8.4 4.9 32.6 14.0 0.049 0.044 0.041 0.041 0.010 0.00 stdev
With the progrssbar, it depends: in lingpy, you need to access the logger, which is a bit tedious, in all of our code, we can just remove it. But this means, we have a mean FS of 0.75 for edit distance, while we have 0.77 for SCA, which is already better, even if it is not much so far.
Here is the result from cognate based cognate sca. Mean fs = 0.75 on par with closest match NED.
10-fold cross-validation on splits directory using cognate_based_cognate_sca.
fn fp tn tp precision recall f1 fb accuracy threshold fold
---- ---- ----- ----- ----------- -------- ----- ----- ---------- ----------- ------
39.0 20.0 807.0 114.0 0.851 0.745 0.794 0.794 0.940 0.46 0
55.0 25.0 880.0 102.0 0.803 0.650 0.718 0.718 0.925 0.46 1
43.0 25.0 817.0 143.0 0.851 0.769 0.808 0.808 0.934 0.46 2
57.0 15.0 846.0 120.0 0.889 0.678 0.769 0.769 0.931 0.46 3
28.0 30.0 896.0 93.0 0.756 0.769 0.762 0.762 0.945 0.46 4
47.0 25.0 838.0 111.0 0.816 0.703 0.755 0.755 0.929 0.46 5
37.0 39.0 808.0 112.0 0.742 0.752 0.747 0.747 0.924 0.46 6
48.0 34.0 873.0 110.0 0.764 0.696 0.728 0.728 0.923 0.46 7
54.0 34.0 858.0 92.0 0.730 0.630 0.676 0.676 0.915 0.46 8
39.0 30.0 868.0 118.0 0.797 0.752 0.774 0.774 0.935 0.46 9
44.7 27.7 849.1 111.5 0.800 0.714 0.753 0.753 0.930 0.46 mean
9.2 7.1 31.2 14.6 0.053 0.050 0.038 0.038 0.009 0.00 stdev
Here is result for cognate based multi-threshold lexstat. Mean fs=0.75 on par with closest match NED and cognate based SCA.
10-fold cross-validation on splits directory using cognate_based_multi_threshold_lexstat.
fn fp tn tp precision recall f1 fb accuracy threshold fold
---- ---- ----- ----- ----------- -------- ----- ----- ---------- ----------- ------
50.0 7.0 820.0 103.0 0.936 0.673 0.783 0.783 0.942 0.36 0
68.0 7.0 898.0 89.0 0.927 0.567 0.704 0.704 0.929 0.36 1
55.0 11.0 831.0 131.0 0.923 0.704 0.799 0.799 0.936 0.36 2
72.0 5.0 856.0 105.0 0.955 0.593 0.732 0.732 0.926 0.36 3
37.0 10.0 916.0 84.0 0.894 0.694 0.781 0.781 0.955 0.36 4
61.0 9.0 854.0 97.0 0.915 0.614 0.735 0.735 0.931 0.36 5
47.0 12.0 835.0 102.0 0.895 0.685 0.776 0.776 0.941 0.36 6
62.0 3.0 904.0 96.0 0.970 0.608 0.747 0.747 0.939 0.36 7
72.0 12.0 880.0 74.0 0.860 0.507 0.638 0.638 0.919 0.36 8
50.0 9.0 889.0 107.0 0.922 0.682 0.784 0.784 0.944 0.36 9
57.4 8.5 868.3 98.8 0.920 0.633 0.748 0.748 0.936 0.36 mean
11.6 3.0 33.6 15.3 0.032 0.065 0.049 0.049 0.010 0.00 stdev
And finally with the classifier ... using just the sca_distance and the doculect index and tokens length with linear SVM. I'll push cross validation changes to repository, and begin experimenting with classifier now. Results consistent with closest match using sca distance. I'll try the 1-hot encoding for language index to see if it improves results.
10-fold cross-validation on splits directory using classifier_based_SVM_linear_sca.
fn fp tn tp precision recall f1 fb accuracy fold
---- ---- ----- ----- ----------- -------- ----- ----- ---------- ------
43.0 7.0 820.0 110.0 0.940 0.719 0.815 0.815 0.949 0
67.0 9.0 896.0 90.0 0.909 0.573 0.703 0.703 0.928 1
53.0 10.0 832.0 133.0 0.930 0.715 0.809 0.809 0.939 2
67.0 6.0 855.0 110.0 0.948 0.621 0.751 0.751 0.930 3
38.0 10.0 916.0 83.0 0.892 0.686 0.776 0.776 0.954 4
55.0 12.0 851.0 103.0 0.896 0.652 0.755 0.755 0.934 5
43.0 12.0 835.0 106.0 0.898 0.711 0.794 0.794 0.945 6
54.0 2.0 905.0 104.0 0.981 0.658 0.788 0.788 0.947 7
62.0 13.0 879.0 84.0 0.866 0.575 0.691 0.691 0.928 8
45.0 10.0 888.0 112.0 0.918 0.713 0.803 0.803 0.948 9
52.7 9.1 867.7 103.5 0.918 0.662 0.768 0.768 0.940 mean
10.4 3.3 33.5 15.0 0.033 0.057 0.043 0.043 0.010 stdev
I discovered the reason why Portuguese was showing up in some of my recently stored wordlists. It's because I had created the k-fold splits in training data before we I had updated my local database to drop the Portuguese. So while I had since updated my database, the separate k-fold splits are still based on the previous dataset.
In the revision I am doing now on classifier, I will include an update to the k-fold data split, which will be without Portuguese.
Since the splits will be new, the results of runs reported above will change of the splits, but they should change little on average and stdev.
Here are new results without any contamination of train, test by Portuguese.
Classifier using:
10-fold cross-validation on splits directory using classifier_based_SVM_linear_sca_ned.
fn fp tn tp precision recall f1 fb accuracy fold
---- ---- ----- ----- ----------- -------- ----- ----- ---------- ------
51.0 6.0 837.0 105.0 0.946 0.673 0.787 0.787 0.943 0
43.0 8.0 844.0 110.0 0.932 0.719 0.812 0.812 0.949 1
51.0 12.0 914.0 85.0 0.876 0.625 0.730 0.730 0.941 2
60.0 6.0 877.0 112.0 0.949 0.651 0.772 0.772 0.937 3
43.0 10.0 931.0 90.0 0.900 0.677 0.773 0.773 0.951 4
49.0 6.0 806.0 121.0 0.953 0.712 0.815 0.815 0.944 5
59.0 7.0 781.0 112.0 0.941 0.655 0.772 0.772 0.931 6
46.0 5.0 853.0 118.0 0.959 0.720 0.822 0.822 0.950 7
38.0 7.0 943.0 98.0 0.933 0.721 0.813 0.813 0.959 8
51.0 6.0 909.0 120.0 0.952 0.702 0.808 0.808 0.948 9
49.1 7.3 869.5 107.1 0.934 0.685 0.790 0.790 0.945 mean
7.0 2.2 54.4 12.5 0.026 0.034 0.029 0.029 0.008 stdev
Closest match SCA global
Closest Match - SCA
10-fold cross-validation on splits directory using sca_distance.
fn fp tn tp precision recall f1 fb accuracy threshold fold
---- ---- ----- ----- ----------- -------- ----- ----- ---------- ----------- ------
52.0 14.0 829.0 104.0 0.881 0.667 0.759 0.759 0.934 0.40 0
43.0 15.0 837.0 110.0 0.880 0.719 0.791 0.791 0.942 0.40 1
52.0 20.0 906.0 84.0 0.808 0.618 0.700 0.700 0.932 0.40 2
59.0 12.0 871.0 113.0 0.904 0.657 0.761 0.761 0.933 0.40 3
44.0 14.0 927.0 89.0 0.864 0.669 0.754 0.754 0.946 0.40 4
52.0 13.0 799.0 118.0 0.901 0.694 0.784 0.784 0.934 0.40 5
54.0 10.0 778.0 117.0 0.921 0.684 0.785 0.785 0.933 0.40 6
43.0 12.0 846.0 121.0 0.910 0.738 0.815 0.815 0.946 0.40 7
32.0 18.0 932.0 104.0 0.852 0.765 0.806 0.806 0.954 0.40 8
55.0 17.0 898.0 116.0 0.872 0.678 0.763 0.763 0.934 0.40 9
48.6 14.5 862.3 107.6 0.879 0.689 0.772 0.772 0.939 0.40 mean
8.0 3.1 53.2 12.5 0.033 0.042 0.033 0.033 0.008 0.00 stdev
Cognate based SCA and Closest NED are slightly lower at 0.75.
So we have SVM > closest match > cognate-based. SCA > NED. Looks nice to me as a result ;)
A couple of issues with the classifier to resolve. I'll push current solution to repository for review and attack these issues:
BTW, with Cognate method, we also had multiple possibilities to consider:
Here is run with cognate based SCA using corrected splits database: (Similar to before)
10-fold cross-validation on splits directory using cognate_based_cognate_sca.
fn fp tn tp precision recall f1 fb accuracy threshold fold
---- ---- ----- ----- ----------- -------- ----- ----- ---------- ----------- ------
44.0 25.0 818.0 112.0 0.818 0.718 0.765 0.765 0.931 0.46 0
37.0 21.0 831.0 116.0 0.847 0.758 0.800 0.800 0.942 0.46 1
52.0 31.0 895.0 84.0 0.730 0.618 0.669 0.669 0.922 0.46 2
55.0 32.0 851.0 117.0 0.785 0.680 0.729 0.729 0.918 0.46 3
41.0 24.0 917.0 92.0 0.793 0.692 0.739 0.739 0.939 0.46 4
44.0 22.0 790.0 126.0 0.851 0.741 0.792 0.792 0.933 0.46 5
53.0 29.0 759.0 118.0 0.803 0.690 0.742 0.742 0.914 0.46 6
39.0 23.0 835.0 125.0 0.845 0.762 0.801 0.801 0.939 0.46 7
30.0 37.0 913.0 106.0 0.741 0.779 0.760 0.760 0.938 0.46 8
49.0 36.0 879.0 122.0 0.772 0.713 0.742 0.742 0.922 0.46 9
44.4 28.0 848.8 111.8 0.798 0.715 0.754 0.754 0.930 0.46 mean
7.9 5.8 52.5 14.0 0.043 0.048 0.040 0.040 0.010 0.00 stdev
Yes, we have 0/1 output. I the case of the classifier, one can do the sorting based on the SCA distance as well. Does not really hurt, I think. So the decision factor is SCA on 0/1 matches. This sorting by similarity could then also be applied to the Cognate method (where we have similarities anyway in the lexstat class, or can easily compute them).
Reviewing arguments for module closest with thought to review others as well. With changes we had made to module propio, a lot of stuff in run() was just junk. Also I wanted to have the module report the f1 score from test if test file included.
Also experimented with whether individual languages would do better with different thresholds. Some differences for SCA 0.35 to 0.45 range, but test results overall show only modest change (on fold 00) so maybe not worth the effort.
Our inclusion of 1-hot for language in classifier is akin to having separate thresholds in that they provide intercepts for each language.
** Not entirely sure where we go from here, especially if we are to have paper for the June 24 date of EMNLP.
It occurs to me that besides Pairwise scores as features for classifier, we could also produce n-gram (Markov chain) cross-entropies as well. [Neural nets seem too burdensome to manage for this.]
Feeling a bit 'under the weather' today, but will continue working intermittently throughout the day. Thanks @LinguList.
@LinguList Proposed simplification.
In reviewing closest match and others to see what 'run' and 'args' could look like, it seems to me that the introduction of donor_families (me I'm pretty sure) was unnecessary while complicating stuff for each module. Yes donor families does work, but just using the donors list would seem to work as well, as long as no target languages in the study are also in a donor family. This is met for the SaBor study and I think for the KeyPano where we would have [Spanish, Portuguese].
Here is a snippet from Closest code showing how family is used.
for idx in wordlist:
if wordlist[idx, "doculect"] in donors:
concepts[wordlist[idx, concept]][0] += [idx]
# languages from donor families are not target languages.
elif wordlist[idx, family] not in donor_families:
concepts[wordlist[idx, concept]][1] += [idx]
Would the 'elif' condition be needed if our donors list is complete?
Similarly for other modules. Similar for evaluation, both in lexibank_sabor.py and in evaluate.py.
I propose to use just donors list for this, which means we could also drop my hack for getting a list of donor_families too.
Thoughts...
I understand this is a bit artificial. And it was yes, only used, because we had Portuguese as a non-Donor in our list, where it could also be a donor, but the data would not indicate it. Question is: do we need it anywhere else? Maybe not, we could just restrict to settings with one donor language vs. other languages. And we don't need the family information now, that we decided to drop the multi-threshold, as this is an explicit situation in which we NEED to distinguish in this regard.
So if you think it will help so much in simplifying, it is okay to drop it for me.
Changes to add parser arguments and simplify a bit. Still permits multiple donors. Only considers the donor list versus the target/recipient languages without recourse to family. Only downside is we don't want extraneous languages from the same family as a donor in the list of target languages. Not generally a problem, and possible to select languages if we apply to other context. When we put back in multi-threshold we can retake family as well...
Will push to repository now.
Yes, we have 0/1 output. I the case of the classifier, one can do the sorting based on the SCA distance as well. Does not really hurt, I think. So the decision factor is SCA on 0/1 matches. This sorting by similarity could then also be applied to the Cognate method (where we have similarities anyway in the lexstat class, or can easily compute them).
How do I access similarities from the LexStat class? I hadn't incorporated Cognate into the Classifier [and maybe won't now for time constraints with a visit to Cusco part of next week], but this would be a good addition to the Closest match method. Maybe with a less demanding threshold [allowing for higher recall] and so giving more power the to classifier to weight based on similarity too.
The easiest way is to use the align_pairs
method (see here: http://lingpy.org/reference/lingpy.compare.html#lingpy.compare.lexstat.LexStat.align_pairs). This recomputes alignments, but this is not a problem, I'd say.
Another possibility is of course, during training, to store the scores in a dictionary, right? Again, with IDs of the words, so one can later retrieve them directly.
Otherwise, but this seems to be too complicated now, there is a class that computes the actual cognates, which creates a distance matrics. I thought of this one, but do now think it is less useful.
Am also returning structure to just saborcommands. Dropping entirely the sabor library, and the saborncommands folder after having dropped the older commands and moved the new commands to the saborcommands folder. Also change to setup.py to reference saborcommands.
Yes, good idea.
If we manage to follow up on what we had so far, we have four methods we discuss:
I'd suggest to give these new names: