update - Githubissues

LinguList commented 2 years ago

@fractaldragonflies, I did not mean to merge right away but wanted to show the major idea. Now I found time to expand it and illustrate what I meant by the predict function.

I proceed now as follows:

functions that are generic (e.g., the evaluation code) should be placed into the main package lexibank_sabor, so we can afterwards put them in an external library (e.g. lingrex) and test them
only more specific parts are retained in our commands (I consider the threshold search for rather specific, and it was merely to see how the signature works)
the predict function should not predict ALL at once, but only one batch of the data, as we know from sklearn. So my solution is now to predict only for one concept, which makes most sense. Using the {idx: tokens} dictionary structure has the advantage of allowing us to even use language names as ID in illustrative examples.
a predict function for the entire test data is also added, it takes a loaded wordlist as input and just applies the function which I now call simple_donor_search (as the "pairwise" is not descriptive for what the method does).

LinguList commented 2 years ago

Your idea to compute distances only once is very good, I found a workaround that "fakes" a function but pulls out the previously stored values. So I consider this as accounted for now.

LinguList commented 2 years ago

If there are issues with the predict_wordlist, let me know, or adjust the output accordingly. the idea is: the wordlist is modified, you use one more command to save it with wordlist.output() and that's it, you can also check predictions there with the evaluate_vs command you wrote, so I think there should be a lot of options.

I meant yesterday, we should discuss what I propose, not that I impose it, and then merge. And I meant: it was not really ready, as parts were missing, and I did not account for some things.

But I have to admit that I am happy with this now myself, so I'd also merge it.

LinguList commented 2 years ago

What I find specifically important: I think I could write the SVM class now. I was not sure before, so this exercise helped me to break functions a bit apart, etc., so that I now know better how to proceed. And I was afraid of the classifier-method, as I never thought it through to the end.

I'd say: we proceed by doing the three methods now in this style, with training. We could split work on them. For the classifier-method I'd like to be especially involved, and I could start with that one, if you want.

fractaldragonflies commented 2 years ago

When you say the three methods, you are referring to: 1) this pairwise related method with SCA, NED, other? functions, 2) the classifier method - which combines results from pairwise and maybe other methods, 3) Cluster or LingRex methods? (From a whole 3 weeks ago! We've explored lots in this time.)

Sure, I'm game for the Cluster/LingRex method(s) to bring back in the Cluster/LingRex methods within the constraints, protocol of the pairwise related methods (if that is what the 3rd method is). And you're welcome to start with the classifier method then.

I'd say: we proceed by doing the three methods now in this style, with training. We could split work on them. For the classifier-method I'd like to be especially involved, and I could start with that one, if you want.

fractaldragonflies commented 2 years ago

Checks out using train and test files from 'splits' directory. 'predict' requires an instantiated wordlist since it invokes 'simple_donor_search' function from lexibank_sabor rather than the Class creation which converts the file_path into an internal wordlist.

If there are issues with the predict_wordlist, let me know, or adjust the output accordingly. the idea is: the wordlist is modified, you use one more command to save it with wordlist.output() and that's it, you can also check predictions there with the evaluate_vs command you wrote, so I think there should be a lot of options.

    bor = SimpleDonorSearch(
        'splits/CV10-fold-00-train.tsv', donors="Spanish",
        func=sca_distance, family="language_family")
    bor.train(verbose=True, thresholds=[i*0.02 for i in range(1, 50)])
    print("best threshold is {0:.2f}".format(bor.best_t))
    bor.predict_on_wordlist(bor)
    file_path = 'store/test-new-predict-CV10-fold-00-train'
    bor.output('tsv', filename=file_path, prettify=False, ignore="all")

    # Need to create wordlist because predict does not convert infile to wordlist.
    wl = Wordlist('splits/CV10-fold-00-test.tsv')
    bor.predict_on_wordlist(wl)
    file_path = 'store/test-new-predict-CV10-fold-00-test'
    wl.output('tsv', filename=file_path, prettify=False, ignore="all")

LinguList commented 2 years ago

Yes, this creation was intended. I was not sure, but thought in the end, it is more explicit.

LinguList commented 2 years ago

For the methods, OI made an issue on the names #15, but need to memorize them.

fractaldragonflies commented 2 years ago

OK, then using the naming you suggest, which I like btw, I would move forward with cognate-based and/or multi-threshold. They are closely related (cluster methods), so solution for cognate-based could well encompass multi-threshold as well, but I would not get too general until I can show you a first attempt with cognate-based.

Question on the cluster based methods, the product of these methods is cognate or borrowing indications. But are there intermediate results available as well -- such as distance to the nearest cluster from the target entry? With centroid based clustering we'd have such measures, but here I don't know.

There must be since we use a threshold to decide cluster membership. Question is how to expose this distance. It may not be as useful as for Pairwise methods since the distance is from target too cluster and not target to donor, but maybe still useful... and maybe the Euclidean distance between donor and target, or the Sum of target to cluster and Donor to cluster distance would be useful... or both if used by SVM!

When you say the three methods, you are referring to: 1) this pairwise related method with SCA, NED, other? functions, 2) the classifier method - which combines results from pairwise and maybe other methods, 3) Cluster or LingRex methods? (From a whole 3 weeks ago! We've explored lots in this time.)

Sure, I'm game for the Cluster/LingRex method(s) to bring back in the Cluster/LingRex methods within the constraints, protocol of the pairwise related methods (if that is what the 3rd method is). And you're welcome to start with the classifier method then.

I'd say: we proceed by doing the three methods now in this style, with training. We could split work on them. For the classifier-method I'd like to be especially involved, and I could start with that one, if you want.

fractaldragonflies commented 2 years ago

Here were my previous thoughts on the classifier-method for your consideration:

Train computes distance measures with functions (SCA, NED, ...).
- These distances are part of the X.
  - Distance is uncensored by any threshold.
- Languages would be other part of X.
- Y would be the assigned language. Blank if inherited, and donor language if borrowed word.
- Train would construct the predictor that optimizes F1 score.
  - Classifier predictor that optimizes training F1 score over languages and language pairs.
    - So this could consider NED distance by language, and SCA distance by language as an integral part of the model.
Predict would use trained predictor on test data
- Xs would be computed as for train.
  - distances for test data and language indicator.
  - make predictions from model.

What I find specifically important: I think I could write the SVM class now. I was not sure before, so this exercise helped me to break functions a bit apart, etc., so that I now know better how to proceed. And I was afraid of the classifier-method, as I never thought it through to the end.

I'd say: we proceed by doing the three methods now in this style, with training. We could split work on them. For the classifier-method I'd like to be especially involved, and I could start with that one, if you want.

LinguList commented 1 year ago

An intermeddiate results would have to be calculated or retrieved from within LingPy's class, which is tedious, and not worth the pain, I'd argue.

For the "cognate-based" clustering, we can actually test two versions: LexStat with sound correspondences and SCA (method="lexstat", vs. method="sca"). My guess is: LexStat underperforms, since we usually don't have good sound correspondences in borrowings.

lexibank / sabor

update #14