lexibank / sabor

CLDF datasets accompanying investigations on automated borrowing detection in SA languages
Creative Commons Attribution 4.0 International
0 stars 0 forks source link

Naming of Methods #15

Closed LinguList closed 1 year ago

LinguList commented 2 years ago

If we manage to follow up on what we had so far, we have four methods we discuss:

  1. using a direct comparison of potential donor words with potential target words (this method has so far not been discussed, but is very simple and obvious)
  2. computing cognates in general across families and then defining those crossing families as borrowed (this method was used in Hantgan et al. 2022)
  3. computing cognates with two thresholds, one language-internally, one outside (method described in List and Forkel 2022)
  4. a new (our method) classifier-based supervised borrowing detection procedure, by which we compute different statistics for pairwise string similarities, including potentially even more than that, and then feeding a classifier (SVM or similar) with these. The classifier will then do pairwise decisions for a potential donor and a potential target (yes/no)

I'd suggest to give these new names:

Name Abbreviation Note
Closest Match CM what we called "pairwise" so far, it is rather "closest match"
Cognate-Based CB what we called "family" it is based on identifying cognates
Multi-Threshold MT what we called "lingrex" it is based on two thresholds
Classifier-Based ClfB what we called "SVM", it is based on a classifier, which can be a NN, SVM, etc.
LinguList commented 2 years ago

What is nice for a potential paper here is that we have a very clear baseline, two competing methods, and one new method.

fractaldragonflies commented 2 years ago

@LinguList, where do we go from here?

We have 4 methods that are seemingly feature complete - for donor focused borrowing detection. We have the previously constructed 10-fold train - test division, and the detail_evaluate module which computes the donor-focused F1 score and related measures - overall and by language.

  1. I could upgrade the previous cross_validate_pairwise function to work with all methods, since the signature is the same, key column names are the same in results.
  2. With respect to the classifier based, I had hoped for improved detection since it uses pairwise methods that themselves aren't bad and moreover, and it can treat each language independently. 2.1 I haven't carefully examined the SVM model implemented, so I don't know that independent treatment of languages is indeed part of the model (of if it is even useful). 2.2 Should we add more functions to the classifier based?
LinguList commented 2 years ago
  1. yes, that may be useful! Maybe, can we change the command names? Shell-scripts rarely have underscores, and something simple telling would be very nice, something like simply evaluate?
  2. Adding functions is simple, we can try local, overlap, and all kinds of variants. With partial they can be created quite easily, or one just makes explicit functions, all fine. Having fitted the SVM, one can check for feature importance, which may be useful.
  3. emnlp may be a good venue

Can I ask you, @fractaldragonflies, to assign tasks to me in issues? Like "check cognate borrowing detection" or "make lexstat work only once", so that I can cross them off later, when I find time?

LinguList commented 2 years ago

@fractaldragonflies, just figured I should give you examples for adding more functions. In fact, the local alignment function already accounts for only extracting the valid part of an alignment, so the distance calculated there should be reliable.

One can just vary the modes in the kw to "local", "global", "overlap", "dialign", so we have four functions instead of one. As a variant of the edit dist, you can do:

def percentage_identity(almA, almB, mode=1):
    almA, almB, _ = nw_align(almA, almB)
    return pid(almA, almB, mode=mode)

There are 4 modes, one can vary them again. These are not distances, but similarities.

LinguList commented 2 years ago

If you check what I feed to the SVM, you see I feed "props", these are things like the length of the sequence, the language (represented by an integer).

We know, SVM is good at one-hot for categorical variables. So we may do a one-hot of the languages. I can try and add that later.

fractaldragonflies commented 2 years ago

Working on cross-validation today. OK, will change command names as part of cross-validation, since it will invoke functions from each file to perform. Proposed names below. Hope to get to looking at adding functions to classifier-based soon. But if you have time, you're certainly welcome to continue your work on that function including the 1-hot language property. [OK, follow you 'open issue' suggestion pronto.] On multi-threshold variant of cognate-based, the LexStat internal_cognate function is already executed only ONCE for the various thresholds that may be used for the external_cognate function. I had realized after a few runs, that external_cognate trials were based on the same internal_cognate arguments!

  1. yes, that may be useful! Maybe, can we change the command names? Shell-scripts rarely have underscores, and something simple telling would be very nice, something like simply evaluate?
  2. Adding functions is simple, we can try local, overlap, and all kinds of variants. With partial they can be created quite easily, or one just makes explicit functions, all fine. Having fitted the SVM, one can check for feature importance, which may be useful.

How about these names?

  1. 'closestmatch' or just 'closest' or even 'match' [funny word closest!]
  2. 'cognatebased' or just 'cognate' [Includes both original cognate based and multi-threshold versions. Would you want these split up? Class is the same, but function is different.]
  3. 'classifierbased' or just 'classifier'
  4. 'evaluate' [The detail_evaluate function replacing evaluate.]
  5. 'crossvalidate' or just 'validate' or even 'crossval' OR 'kfold' or 'kfoldcv'? [My preference is for at least whole names for commands, but I know that's just my bias.]
fractaldragonflies commented 2 years ago

Added branch for cross validate work. Pushed 1 commit to branch to show my work with just closest-match functioning, but possibility to extend to others. A bit complicated given all the flexibility we have for the various methods. Current solution is first effort, thinking I could have config file driven but coded in script for now.

Here are results for closest-match:

(lingsabor) sabor % cldfbench sabor.crossvalidate 10
INFO    10-fold cross-validation on splits directory.                                                                              
  fn    fp     tn     tp    precision    recall     f1     fb    accuracy    threshold  fold
----  ----  -----  -----  -----------  --------  -----  -----  ----------  -----------  ------
36.0  11.0  816.0  117.0        0.914     0.765  0.833  0.833       0.952         0.40  0
59.0  15.0  890.0   98.0        0.867     0.624  0.726  0.726       0.930         0.40  1
49.0  15.0  827.0  137.0        0.901     0.737  0.811  0.811       0.938         0.40  2
64.0   8.0  853.0  113.0        0.934     0.638  0.758  0.758       0.931         0.40  3
32.0  13.0  913.0   89.0        0.873     0.736  0.798  0.798       0.957         0.40  4
53.0  20.0  843.0  105.0        0.840     0.665  0.742  0.742       0.929         0.40  5
40.0  17.0  830.0  109.0        0.865     0.732  0.793  0.793       0.943         0.40  6
54.0   8.0  899.0  104.0        0.929     0.658  0.770  0.770       0.942         0.40  7
58.0  20.0  872.0   88.0        0.815     0.603  0.693  0.693       0.925         0.40  8
41.0  18.0  880.0  116.0        0.866     0.739  0.797  0.797       0.944         0.40  9
48.6  14.5  862.3  107.6        0.880     0.690  0.772  0.772       0.939         0.40  mean
10.8   4.5   33.3   14.5        0.039     0.058  0.043  0.043       0.010         0.00  stdev
LinguList commented 2 years ago

Nice! This is the closest match with edit distance or with SCA?

fractaldragonflies commented 2 years ago

Previous was with SCA (mean=0.77). Here is the NED (mean=0.75). I added name of fn to output. Is there a way to suppress the progress bar and merely show the folds? I print the fold with LF, but only to not be erased by the progress bar. It would otherwise look like 0 1 2 3 4 5 6 7 8 9

 $ cldfbench sabor.crossvalidate 10
INFO    10-fold cross-validation on splits directory using edit_distance.
folds: 
0                                                                                                                                                                                                                               
1                                                                                                                                                                                                                               
2                                                                                                                                                                                                                               
3                                                                                                                                                                                                                               
4                                                                                                                                                                                                                               
5                                                                                                                                                                                                                               
6                                                                                                                                                                                                                               
7                                                                                                                                                                                                                               
8                                                                                                                                                                                                                               
9                                                                                                                                                                                                                               

  fn    fp     tn     tp    precision    recall     f1     fb    accuracy    threshold  fold
----  ----  -----  -----  -----------  --------  -----  -----  ----------  -----------  ------
48.0  17.0  810.0  105.0        0.861     0.686  0.764  0.764       0.934         0.60  0
58.0   9.0  896.0   99.0        0.917     0.631  0.747  0.747       0.937         0.60  1
58.0  12.0  830.0  128.0        0.914     0.688  0.785  0.785       0.932         0.60  2
64.0  12.0  849.0  113.0        0.904     0.638  0.748  0.748       0.927         0.60  3
39.0  18.0  908.0   82.0        0.820     0.678  0.742  0.742       0.946         0.60  4
60.0  15.0  848.0   98.0        0.867     0.620  0.723  0.723       0.927         0.60  5
51.0  13.0  834.0   98.0        0.883     0.658  0.754  0.754       0.936         0.60  6
47.0  15.0  892.0  111.0        0.881     0.703  0.782  0.782       0.942         0.60  7
63.0  27.0  865.0   83.0        0.755     0.568  0.648  0.648       0.913         0.60  8
46.0  14.0  884.0  111.0        0.888     0.707  0.787  0.787       0.943         0.60  9
53.4  15.2  861.6  102.8        0.869     0.658  0.748  0.748       0.934         0.60  mean
 8.4   4.9   32.6   14.0        0.049     0.044  0.041  0.041       0.010         0.00  stdev
LinguList commented 2 years ago

With the progrssbar, it depends: in lingpy, you need to access the logger, which is a bit tedious, in all of our code, we can just remove it. But this means, we have a mean FS of 0.75 for edit distance, while we have 0.77 for SCA, which is already better, even if it is not much so far.

fractaldragonflies commented 2 years ago

Here is the result from cognate based cognate sca. Mean fs = 0.75 on par with closest match NED.

10-fold cross-validation on splits directory using cognate_based_cognate_sca.
  fn    fp     tn     tp    precision    recall     f1     fb    accuracy    threshold  fold
----  ----  -----  -----  -----------  --------  -----  -----  ----------  -----------  ------
39.0  20.0  807.0  114.0        0.851     0.745  0.794  0.794       0.940         0.46  0
55.0  25.0  880.0  102.0        0.803     0.650  0.718  0.718       0.925         0.46  1
43.0  25.0  817.0  143.0        0.851     0.769  0.808  0.808       0.934         0.46  2
57.0  15.0  846.0  120.0        0.889     0.678  0.769  0.769       0.931         0.46  3
28.0  30.0  896.0   93.0        0.756     0.769  0.762  0.762       0.945         0.46  4
47.0  25.0  838.0  111.0        0.816     0.703  0.755  0.755       0.929         0.46  5
37.0  39.0  808.0  112.0        0.742     0.752  0.747  0.747       0.924         0.46  6
48.0  34.0  873.0  110.0        0.764     0.696  0.728  0.728       0.923         0.46  7
54.0  34.0  858.0   92.0        0.730     0.630  0.676  0.676       0.915         0.46  8
39.0  30.0  868.0  118.0        0.797     0.752  0.774  0.774       0.935         0.46  9
44.7  27.7  849.1  111.5        0.800     0.714  0.753  0.753       0.930         0.46  mean
 9.2   7.1   31.2   14.6        0.053     0.050  0.038  0.038       0.009         0.00  stdev
fractaldragonflies commented 2 years ago

Here is result for cognate based multi-threshold lexstat. Mean fs=0.75 on par with closest match NED and cognate based SCA.

10-fold cross-validation on splits directory using cognate_based_multi_threshold_lexstat.
  fn    fp     tn     tp    precision    recall     f1     fb    accuracy    threshold  fold
----  ----  -----  -----  -----------  --------  -----  -----  ----------  -----------  ------
50.0   7.0  820.0  103.0        0.936     0.673  0.783  0.783       0.942         0.36  0
68.0   7.0  898.0   89.0        0.927     0.567  0.704  0.704       0.929         0.36  1
55.0  11.0  831.0  131.0        0.923     0.704  0.799  0.799       0.936         0.36  2
72.0   5.0  856.0  105.0        0.955     0.593  0.732  0.732       0.926         0.36  3
37.0  10.0  916.0   84.0        0.894     0.694  0.781  0.781       0.955         0.36  4
61.0   9.0  854.0   97.0        0.915     0.614  0.735  0.735       0.931         0.36  5
47.0  12.0  835.0  102.0        0.895     0.685  0.776  0.776       0.941         0.36  6
62.0   3.0  904.0   96.0        0.970     0.608  0.747  0.747       0.939         0.36  7
72.0  12.0  880.0   74.0        0.860     0.507  0.638  0.638       0.919         0.36  8
50.0   9.0  889.0  107.0        0.922     0.682  0.784  0.784       0.944         0.36  9
57.4   8.5  868.3   98.8        0.920     0.633  0.748  0.748       0.936         0.36  mean
11.6   3.0   33.6   15.3        0.032     0.065  0.049  0.049       0.010         0.00  stdev
fractaldragonflies commented 2 years ago

And finally with the classifier ... using just the sca_distance and the doculect index and tokens length with linear SVM. I'll push cross validation changes to repository, and begin experimenting with classifier now. Results consistent with closest match using sca distance. I'll try the 1-hot encoding for language index to see if it improves results.

10-fold cross-validation on splits directory using classifier_based_SVM_linear_sca.
  fn    fp     tn     tp    precision    recall     f1     fb    accuracy  fold
----  ----  -----  -----  -----------  --------  -----  -----  ----------  ------
43.0   7.0  820.0  110.0        0.940     0.719  0.815  0.815       0.949  0
67.0   9.0  896.0   90.0        0.909     0.573  0.703  0.703       0.928  1
53.0  10.0  832.0  133.0        0.930     0.715  0.809  0.809       0.939  2
67.0   6.0  855.0  110.0        0.948     0.621  0.751  0.751       0.930  3
38.0  10.0  916.0   83.0        0.892     0.686  0.776  0.776       0.954  4
55.0  12.0  851.0  103.0        0.896     0.652  0.755  0.755       0.934  5
43.0  12.0  835.0  106.0        0.898     0.711  0.794  0.794       0.945  6
54.0   2.0  905.0  104.0        0.981     0.658  0.788  0.788       0.947  7
62.0  13.0  879.0   84.0        0.866     0.575  0.691  0.691       0.928  8
45.0  10.0  888.0  112.0        0.918     0.713  0.803  0.803       0.948  9
52.7   9.1  867.7  103.5        0.918     0.662  0.768  0.768       0.940  mean
10.4   3.3   33.5   15.0        0.033     0.057  0.043  0.043       0.010  stdev
fractaldragonflies commented 2 years ago

I discovered the reason why Portuguese was showing up in some of my recently stored wordlists. It's because I had created the k-fold splits in training data before we I had updated my local database to drop the Portuguese. So while I had since updated my database, the separate k-fold splits are still based on the previous dataset.

In the revision I am doing now on classifier, I will include an update to the k-fold data split, which will be without Portuguese.

Since the splits will be new, the results of runs reported above will change of the splits, but they should change little on average and stdev.

fractaldragonflies commented 2 years ago

Here are new results without any contamination of train, test by Portuguese.

Classifier using:

10-fold cross-validation on splits directory using classifier_based_SVM_linear_sca_ned.
  fn    fp     tn     tp    precision    recall     f1     fb    accuracy  fold
----  ----  -----  -----  -----------  --------  -----  -----  ----------  ------
51.0   6.0  837.0  105.0        0.946     0.673  0.787  0.787       0.943  0
43.0   8.0  844.0  110.0        0.932     0.719  0.812  0.812       0.949  1
51.0  12.0  914.0   85.0        0.876     0.625  0.730  0.730       0.941  2
60.0   6.0  877.0  112.0        0.949     0.651  0.772  0.772       0.937  3
43.0  10.0  931.0   90.0        0.900     0.677  0.773  0.773       0.951  4
49.0   6.0  806.0  121.0        0.953     0.712  0.815  0.815       0.944  5
59.0   7.0  781.0  112.0        0.941     0.655  0.772  0.772       0.931  6
46.0   5.0  853.0  118.0        0.959     0.720  0.822  0.822       0.950  7
38.0   7.0  943.0   98.0        0.933     0.721  0.813  0.813       0.959  8
51.0   6.0  909.0  120.0        0.952     0.702  0.808  0.808       0.948  9
49.1   7.3  869.5  107.1        0.934     0.685  0.790  0.790       0.945  mean
 7.0   2.2   54.4   12.5        0.026     0.034  0.029  0.029       0.008  stdev

Closest match SCA global

Closest Match - SCA

10-fold cross-validation on splits directory using sca_distance.
  fn    fp     tn     tp    precision    recall     f1     fb    accuracy    threshold  fold
----  ----  -----  -----  -----------  --------  -----  -----  ----------  -----------  ------
52.0  14.0  829.0  104.0        0.881     0.667  0.759  0.759       0.934         0.40  0
43.0  15.0  837.0  110.0        0.880     0.719  0.791  0.791       0.942         0.40  1
52.0  20.0  906.0   84.0        0.808     0.618  0.700  0.700       0.932         0.40  2
59.0  12.0  871.0  113.0        0.904     0.657  0.761  0.761       0.933         0.40  3
44.0  14.0  927.0   89.0        0.864     0.669  0.754  0.754       0.946         0.40  4
52.0  13.0  799.0  118.0        0.901     0.694  0.784  0.784       0.934         0.40  5
54.0  10.0  778.0  117.0        0.921     0.684  0.785  0.785       0.933         0.40  6
43.0  12.0  846.0  121.0        0.910     0.738  0.815  0.815       0.946         0.40  7
32.0  18.0  932.0  104.0        0.852     0.765  0.806  0.806       0.954         0.40  8
55.0  17.0  898.0  116.0        0.872     0.678  0.763  0.763       0.934         0.40  9
48.6  14.5  862.3  107.6        0.879     0.689  0.772  0.772       0.939         0.40  mean
 8.0   3.1   53.2   12.5        0.033     0.042  0.033  0.033       0.008         0.00  stdev

Cognate based SCA and Closest NED are slightly lower at 0.75.

LinguList commented 2 years ago

So we have SVM > closest match > cognate-based. SCA > NED. Looks nice to me as a result ;)

fractaldragonflies commented 2 years ago

A couple of issues with the classifier to resolve. I'll push current solution to repository for review and attack these issues:

  1. Predictions are given only as 0/1.
    1.1. I need to generate as '' or 'Spanish' in current case of checking for Spanish donor words. 1.2. Also need to generate reference to corresponding source_ref.
  2. Thought there was discrepancy in number of entries between original data and prediction, but that was error for my part.
  3. Use of reverse sort and choosing the initial prediction gets a matching entry. 3.1. With multiple donors, we would need to order by some measure of distance between donor and target, given that they are already shown to be related.

BTW, with Cognate method, we also had multiple possibilities to consider:

  1. I made the same solution of taking the entry ref from the first pair on those that qualify.
  2. This is not an adequate solution either when we have multiple donors.
  3. I am thinking that I could use pairwise alignment in this multiple case to pick the closest fit.
fractaldragonflies commented 2 years ago

Here is run with cognate based SCA using corrected splits database: (Similar to before)

10-fold cross-validation on splits directory using cognate_based_cognate_sca.
  fn    fp     tn     tp    precision    recall     f1     fb    accuracy    threshold  fold
----  ----  -----  -----  -----------  --------  -----  -----  ----------  -----------  ------
44.0  25.0  818.0  112.0        0.818     0.718  0.765  0.765       0.931         0.46  0
37.0  21.0  831.0  116.0        0.847     0.758  0.800  0.800       0.942         0.46  1
52.0  31.0  895.0   84.0        0.730     0.618  0.669  0.669       0.922         0.46  2
55.0  32.0  851.0  117.0        0.785     0.680  0.729  0.729       0.918         0.46  3
41.0  24.0  917.0   92.0        0.793     0.692  0.739  0.739       0.939         0.46  4
44.0  22.0  790.0  126.0        0.851     0.741  0.792  0.792       0.933         0.46  5
53.0  29.0  759.0  118.0        0.803     0.690  0.742  0.742       0.914         0.46  6
39.0  23.0  835.0  125.0        0.845     0.762  0.801  0.801       0.939         0.46  7
30.0  37.0  913.0  106.0        0.741     0.779  0.760  0.760       0.938         0.46  8
49.0  36.0  879.0  122.0        0.772     0.713  0.742  0.742       0.922         0.46  9
44.4  28.0  848.8  111.8        0.798     0.715  0.754  0.754       0.930         0.46  mean
 7.9   5.8   52.5   14.0        0.043     0.048  0.040  0.040       0.010         0.00  stdev
LinguList commented 2 years ago

Yes, we have 0/1 output. I the case of the classifier, one can do the sorting based on the SCA distance as well. Does not really hurt, I think. So the decision factor is SCA on 0/1 matches. This sorting by similarity could then also be applied to the Cognate method (where we have similarities anyway in the lexstat class, or can easily compute them).

fractaldragonflies commented 2 years ago

Feeling a bit 'under the weather' today, but will continue working intermittently throughout the day. Thanks @LinguList.

fractaldragonflies commented 2 years ago

@LinguList Proposed simplification.
In reviewing closest match and others to see what 'run' and 'args' could look like, it seems to me that the introduction of donor_families (me I'm pretty sure) was unnecessary while complicating stuff for each module. Yes donor families does work, but just using the donors list would seem to work as well, as long as no target languages in the study are also in a donor family. This is met for the SaBor study and I think for the KeyPano where we would have [Spanish, Portuguese].

Here is a snippet from Closest code showing how family is used.

    for idx in wordlist:
        if wordlist[idx, "doculect"] in donors:
            concepts[wordlist[idx, concept]][0] += [idx]
        # languages from donor families are not target languages.
        elif wordlist[idx, family] not in donor_families:
            concepts[wordlist[idx, concept]][1] += [idx]

Would the 'elif' condition be needed if our donors list is complete?

Similarly for other modules. Similar for evaluation, both in lexibank_sabor.py and in evaluate.py.

I propose to use just donors list for this, which means we could also drop my hack for getting a list of donor_families too.

Thoughts...

LinguList commented 2 years ago

I understand this is a bit artificial. And it was yes, only used, because we had Portuguese as a non-Donor in our list, where it could also be a donor, but the data would not indicate it. Question is: do we need it anywhere else? Maybe not, we could just restrict to settings with one donor language vs. other languages. And we don't need the family information now, that we decided to drop the multi-threshold, as this is an explicit situation in which we NEED to distinguish in this regard.

LinguList commented 2 years ago

So if you think it will help so much in simplifying, it is okay to drop it for me.

fractaldragonflies commented 2 years ago

Changes to add parser arguments and simplify a bit. Still permits multiple donors. Only considers the donor list versus the target/recipient languages without recourse to family. Only downside is we don't want extraneous languages from the same family as a donor in the list of target languages. Not generally a problem, and possible to select languages if we apply to other context. When we put back in multi-threshold we can retake family as well...

Will push to repository now.

fractaldragonflies commented 2 years ago

Yes, we have 0/1 output. I the case of the classifier, one can do the sorting based on the SCA distance as well. Does not really hurt, I think. So the decision factor is SCA on 0/1 matches. This sorting by similarity could then also be applied to the Cognate method (where we have similarities anyway in the lexstat class, or can easily compute them).

How do I access similarities from the LexStat class? I hadn't incorporated Cognate into the Classifier [and maybe won't now for time constraints with a visit to Cusco part of next week], but this would be a good addition to the Closest match method. Maybe with a less demanding threshold [allowing for higher recall] and so giving more power the to classifier to weight based on similarity too.

LinguList commented 2 years ago

The easiest way is to use the align_pairs method (see here: http://lingpy.org/reference/lingpy.compare.html#lingpy.compare.lexstat.LexStat.align_pairs). This recomputes alignments, but this is not a problem, I'd say.

Another possibility is of course, during training, to store the scores in a dictionary, right? Again, with IDs of the words, so one can later retrieve them directly.

LinguList commented 2 years ago

Otherwise, but this seems to be too complicated now, there is a class that computes the actual cognates, which creates a distance matrics. I thought of this one, but do now think it is less useful.

fractaldragonflies commented 2 years ago

Am also returning structure to just saborcommands. Dropping entirely the sabor library, and the saborncommands folder after having dropped the older commands and moved the new commands to the saborcommands folder. Also change to setup.py to reference saborcommands.

LinguList commented 2 years ago

Yes, good idea.