Closed LinguList closed 4 years ago
I had forgotten this issue... and had responded in the add-native-basis PR with the names of various analysis methods. OK, a bit redundant, but more detail as to why this matters:
[x] Show differences in entropy distributions of native versus loan words. This is minimal requirement to make any kind of claim that entropies might work.
Show validation results for discriminating between native and loan words.
All of the above do for the recurren neural network (RNN) model as well.
Maybe not essential, but available, is the study of training bias.
Just to document, I ported all of the Markov model code into a cleaner, more accessible format. Based in part on work Tiago had started. Next step is then structuring it so that it exposes directly the predict function, and keep evaluation separate from the training and prediction of the modules.
We've been using the 'purpose' issue to discuss how we will adapt and structure code. Actually adapting code seems that it could go there or here.
I adapted the dual model detect borrowing module to the new structure... well in part at least. And have pushed it to my branch nltk-john-adapt on the share repository. But have not done a pull since it is still a work in progress. Please take a look at especially nltk.py to see if it meets what we are discussing.
Accomplished:
Here is the output of the test script. The last few lines demonstrate individual word prediction as well as data vector prediction. Note, I am using True for native words, and so the correspondence of [0, 1, 0] with [True, True, True] isn't much different than actual English performance!
runfile('/Users/johnmiller/PHD-with-Lingpy/github-archive/monolingual-borrowing-detection/tests/test_detect_dual.py')
Reloaded modules: pybor, mobor, mobor.data, pybor.evaluate, pybor.nltk, lexibank_wold
2020-05-05 20:07:31,098 [INFO] loaded wordlist 1814 concepts and 41 languages
Evaluate train dataset.
Quality metrics:
Binary_prediction(Acc=0.8143564356435643, Maj_acc=0.5717821782178217, Prec=0.8534743202416919, Recall=0.8152958152958153, F1=0.8339483394833948)
Evaluate test dataset.
Quality metrics:
Binary_prediction(Acc=0.694078947368421, Maj_acc=0.5921052631578947, Prec=0.757396449704142, Recall=0.7111111111111111, F1=0.7335243553008596)
word= ['w', ' ', 'a', ' ', 't', ' ', 'e', ' ', 'r', ' ', 'f', ' ', 'a', ' ', 'l', ' ', 'l'] True 0.0
word= [['w', ' ', 'a', ' ', 't', ' ', 'e', ' ', 'r', ' ', 'f', ' ', 'a', ' ', 'l', ' ', 'l'], ['f', ' ', 'o', ' ', 'r', ' ', 'e', ' ', 's', ' ', 't'], ['w', ' ', 'o', ' ', 'o', ' ', 'd']] [ True True True] [0.0, 1.0, 0.0]
Okay, I had a look at the code, but can you tell me quickly how you make the prediction? Either, which part of the code, or what it does conceptually? So do you go for a threshold in the entropies? Where do I find that code? I think we can still reduce it to only one class, and we do not need the splitter, as we work with the development data only for now, so these can again drastically reduce the code, so that it is easier to write the test for the data...
The nltk.py module, DualMarkov class, predict_data function.
This is the method of competing Markov models for predicting entropy. Where the model with the lower entropy prediction wins and becomes the owner/category of the prediction.
This is also why I created a new class for this approach. Entropy calculation is apart and used as input to the decision process. In the native only approach, I use only one Markov model.
Splitter was added by @tresoldi in his initial conversion work, in intent for temporary compatibility. Yes it needs to go away.... also to resolve the issue of empty spaces between segments for formchars and sca.
There is code to be better integrated, but for most purposes this was already done. New code integration can always have new issues, more in line with the current organization. I'm closing this.
@fractaldragonflies, it would help if you list here some of the things that you find important, so we can see how to transfer them.