lingpy / pybor

A Python library for borrowing detection based on lexical language models
Apache License 2.0
3 stars 1 forks source link

implement more of the existing code in the new framework #2

Closed LinguList closed 4 years ago

LinguList commented 4 years ago

@fractaldragonflies, it would help if you list here some of the things that you find important, so we can see how to transfer them.

fractaldragonflies commented 4 years ago

I had forgotten this issue... and had responded in the add-native-basis PR with the names of various analysis methods. OK, a bit redundant, but more detail as to why this matters:

fractaldragonflies commented 4 years ago

Just to document, I ported all of the Markov model code into a cleaner, more accessible format. Based in part on work Tiago had started. Next step is then structuring it so that it exposes directly the predict function, and keep evaluation separate from the training and prediction of the modules.

fractaldragonflies commented 4 years ago

We've been using the 'purpose' issue to discuss how we will adapt and structure code. Actually adapting code seems that it could go there or here.

I adapted the dual model detect borrowing module to the new structure... well in part at least. And have pushed it to my branch nltk-john-adapt on the share repository. But have not done a pull since it is still a work in progress. Please take a look at especially nltk.py to see if it meets what we are discussing.

Accomplished:

Here is the output of the test script. The last few lines demonstrate individual word prediction as well as data vector prediction. Note, I am using True for native words, and so the correspondence of [0, 1, 0] with [True, True, True] isn't much different than actual English performance!

runfile('/Users/johnmiller/PHD-with-Lingpy/github-archive/monolingual-borrowing-detection/tests/test_detect_dual.py')
Reloaded modules: pybor, mobor, mobor.data, pybor.evaluate, pybor.nltk, lexibank_wold
2020-05-05 20:07:31,098 [INFO] loaded wordlist 1814 concepts and 41 languages

Evaluate train dataset.

Quality metrics:
Binary_prediction(Acc=0.8143564356435643, Maj_acc=0.5717821782178217, Prec=0.8534743202416919, Recall=0.8152958152958153, F1=0.8339483394833948)

Evaluate test dataset.

Quality metrics:
Binary_prediction(Acc=0.694078947368421, Maj_acc=0.5921052631578947, Prec=0.757396449704142, Recall=0.7111111111111111, F1=0.7335243553008596)
word= ['w', ' ', 'a', ' ', 't', ' ', 'e', ' ', 'r', ' ', 'f', ' ', 'a', ' ', 'l', ' ', 'l'] True 0.0
word= [['w', ' ', 'a', ' ', 't', ' ', 'e', ' ', 'r', ' ', 'f', ' ', 'a', ' ', 'l', ' ', 'l'], ['f', ' ', 'o', ' ', 'r', ' ', 'e', ' ', 's', ' ', 't'], ['w', ' ', 'o', ' ', 'o', ' ', 'd']] [ True  True  True] [0.0, 1.0, 0.0]
LinguList commented 4 years ago

Okay, I had a look at the code, but can you tell me quickly how you make the prediction? Either, which part of the code, or what it does conceptually? So do you go for a threshold in the entropies? Where do I find that code? I think we can still reduce it to only one class, and we do not need the splitter, as we work with the development data only for now, so these can again drastically reduce the code, so that it is easier to write the test for the data...

fractaldragonflies commented 4 years ago

The nltk.py module, DualMarkov class, predict_data function.

This is the method of competing Markov models for predicting entropy. Where the model with the lower entropy prediction wins and becomes the owner/category of the prediction.

This is also why I created a new class for this approach. Entropy calculation is apart and used as input to the decision process. In the native only approach, I use only one Markov model.

Splitter was added by @tresoldi in his initial conversion work, in intent for temporary compatibility. Yes it needs to go away.... also to resolve the issue of empty spaces between segments for formchars and sca.

tresoldi commented 4 years ago

There is code to be better integrated, but for most purposes this was already done. New code integration can always have new issues, more in line with the current organization. I'm closing this.