Closed LinguList closed 2 years ago
Your idea to compute distances only once is very good, I found a workaround that "fakes" a function but pulls out the previously stored values. So I consider this as accounted for now.
If there are issues with the predict_wordlist, let me know, or adjust the output accordingly. the idea is: the wordlist is modified, you use one more command to save it with wordlist.output() and that's it, you can also check predictions there with the evaluate_vs command you wrote, so I think there should be a lot of options.
I meant yesterday, we should discuss what I propose, not that I impose it, and then merge. And I meant: it was not really ready, as parts were missing, and I did not account for some things.
But I have to admit that I am happy with this now myself, so I'd also merge it.
What I find specifically important: I think I could write the SVM class now. I was not sure before, so this exercise helped me to break functions a bit apart, etc., so that I now know better how to proceed. And I was afraid of the classifier-method, as I never thought it through to the end.
I'd say: we proceed by doing the three methods now in this style, with training. We could split work on them. For the classifier-method I'd like to be especially involved, and I could start with that one, if you want.
When you say the three methods, you are referring to: 1) this pairwise related method with SCA, NED, other? functions, 2) the classifier method - which combines results from pairwise and maybe other methods, 3) Cluster or LingRex methods? (From a whole 3 weeks ago! We've explored lots in this time.)
Sure, I'm game for the Cluster/LingRex method(s) to bring back in the Cluster/LingRex methods within the constraints, protocol of the pairwise related methods (if that is what the 3rd method is). And you're welcome to start with the classifier method then.
I'd say: we proceed by doing the three methods now in this style, with training. We could split work on them. For the classifier-method I'd like to be especially involved, and I could start with that one, if you want.
Checks out using train and test files from 'splits' directory. 'predict' requires an instantiated wordlist since it invokes 'simple_donor_search' function from lexibank_sabor rather than the Class creation which converts the file_path into an internal wordlist.
If there are issues with the predict_wordlist, let me know, or adjust the output accordingly. the idea is: the wordlist is modified, you use one more command to save it with wordlist.output() and that's it, you can also check predictions there with the evaluate_vs command you wrote, so I think there should be a lot of options.
bor = SimpleDonorSearch(
'splits/CV10-fold-00-train.tsv', donors="Spanish",
func=sca_distance, family="language_family")
bor.train(verbose=True, thresholds=[i*0.02 for i in range(1, 50)])
print("best threshold is {0:.2f}".format(bor.best_t))
bor.predict_on_wordlist(bor)
file_path = 'store/test-new-predict-CV10-fold-00-train'
bor.output('tsv', filename=file_path, prettify=False, ignore="all")
# Need to create wordlist because predict does not convert infile to wordlist.
wl = Wordlist('splits/CV10-fold-00-test.tsv')
bor.predict_on_wordlist(wl)
file_path = 'store/test-new-predict-CV10-fold-00-test'
wl.output('tsv', filename=file_path, prettify=False, ignore="all")
Yes, this creation was intended. I was not sure, but thought in the end, it is more explicit.
For the methods, OI made an issue on the names #15, but need to memorize them.
OK, then using the naming you suggest, which I like btw, I would move forward with cognate-based and/or multi-threshold. They are closely related (cluster methods), so solution for cognate-based could well encompass multi-threshold as well, but I would not get too general until I can show you a first attempt with cognate-based.
Question on the cluster based methods, the product of these methods is cognate or borrowing indications. But are there intermediate results available as well -- such as distance to the nearest cluster from the target entry? With centroid based clustering we'd have such measures, but here I don't know.
There must be since we use a threshold to decide cluster membership. Question is how to expose this distance. It may not be as useful as for Pairwise methods since the distance is from target too cluster and not target to donor, but maybe still useful... and maybe the Euclidean distance between donor and target, or the Sum of target to cluster and Donor to cluster distance would be useful... or both if used by SVM!
When you say the three methods, you are referring to: 1) this pairwise related method with SCA, NED, other? functions, 2) the classifier method - which combines results from pairwise and maybe other methods, 3) Cluster or LingRex methods? (From a whole 3 weeks ago! We've explored lots in this time.)
Sure, I'm game for the Cluster/LingRex method(s) to bring back in the Cluster/LingRex methods within the constraints, protocol of the pairwise related methods (if that is what the 3rd method is). And you're welcome to start with the classifier method then.
I'd say: we proceed by doing the three methods now in this style, with training. We could split work on them. For the classifier-method I'd like to be especially involved, and I could start with that one, if you want.
Here were my previous thoughts on the classifier-method for your consideration:
What I find specifically important: I think I could write the SVM class now. I was not sure before, so this exercise helped me to break functions a bit apart, etc., so that I now know better how to proceed. And I was afraid of the classifier-method, as I never thought it through to the end.
I'd say: we proceed by doing the three methods now in this style, with training. We could split work on them. For the classifier-method I'd like to be especially involved, and I could start with that one, if you want.
An intermeddiate results would have to be calculated or retrieved from within LingPy's class, which is tedious, and not worth the pain, I'd argue.
For the "cognate-based" clustering, we can actually test two versions: LexStat with sound correspondences and SCA (method="lexstat", vs. method="sca"). My guess is: LexStat underperforms, since we usually don't have good sound correspondences in borrowings.
@fractaldragonflies, I did not mean to merge right away but wanted to show the major idea. Now I found time to expand it and illustrate what I meant by the predict function.
I proceed now as follows:
lexibank_sabor
, so we can afterwards put them in an external library (e.g. lingrex) and test themsimple_donor_search
(as the "pairwise" is not descriptive for what the method does).