implement more of the existing code in the new framework

LinguList commented 4 years ago

@fractaldragonflies, it would help if you list here some of the things that you find important, so we can see how to transfer them.

fractaldragonflies commented 4 years ago

I had forgotten this issue... and had responded in the add-native-basis PR with the names of various analysis methods. OK, a bit redundant, but more detail as to why this matters:

[x] Show differences in entropy distributions of native versus loan words. This is minimal requirement to make any kind of claim that entropies might work.
Show validation results for discriminating between native and loan words.
- Based on native word entropy model (Markov)
- Learned from training data and used to predict native and loan on validation data.
- Do k-fold replications of this training versus validation to show 'average' discrimination performance.
- Based on native and loan word entropy models (Markov)
- Native and loan models learned from corresponding native and loan words of training set.
- Predict to words of validation set using competing entropy models.
- Do k-fold replications of training and validation to show 'average' discrimination performance.
All of the above do for the recurren neural network (RNN) model as well.
- Possible exception of k-fold replication for RNNs.
Maybe not essential, but available, is the study of training bias.
- We look to see how much bias there is between entropy results from training versus validation, even when there is no difference expected.

fractaldragonflies commented 4 years ago

Just to document, I ported all of the Markov model code into a cleaner, more accessible format. Based in part on work Tiago had started. Next step is then structuring it so that it exposes directly the predict function, and keep evaluation separate from the training and prediction of the modules.

fractaldragonflies commented 4 years ago

We've been using the 'purpose' issue to discuss how we will adapt and structure code. Actually adapting code seems that it could go there or here.

I adapted the dual model detect borrowing module to the new structure... well in part at least. And have pushed it to my branch nltk-john-adapt on the share repository. But have not done a pull since it is still a work in progress. Please take a look at especially nltk.py to see if it meets what we are discussing.

Accomplished:

Reduced markov class (MarkovWord) to about 1/3 original size,
Created DualMarkov class to manage the competing models and expose predict, predict_data,
Moved low level evaluate function to evaluate,
Retained top level function because I'm not sure how we want to manage this user case, and for short term compatibility with prior work,
Wrote simple test script, still using the data function from mobor, to demonstrate the module.

Here is the output of the test script. The last few lines demonstrate individual word prediction as well as data vector prediction. Note, I am using True for native words, and so the correspondence of [0, 1, 0] with [True, True, True] isn't much different than actual English performance!

runfile('/Users/johnmiller/PHD-with-Lingpy/github-archive/monolingual-borrowing-detection/tests/test_detect_dual.py')
Reloaded modules: pybor, mobor, mobor.data, pybor.evaluate, pybor.nltk, lexibank_wold
2020-05-05 20:07:31,098 [INFO] loaded wordlist 1814 concepts and 41 languages

Evaluate train dataset.

Quality metrics:
Binary_prediction(Acc=0.8143564356435643, Maj_acc=0.5717821782178217, Prec=0.8534743202416919, Recall=0.8152958152958153, F1=0.8339483394833948)

Evaluate test dataset.

Quality metrics:
Binary_prediction(Acc=0.694078947368421, Maj_acc=0.5921052631578947, Prec=0.757396449704142, Recall=0.7111111111111111, F1=0.7335243553008596)
word= ['w', ' ', 'a', ' ', 't', ' ', 'e', ' ', 'r', ' ', 'f', ' ', 'a', ' ', 'l', ' ', 'l'] True 0.0
word= [['w', ' ', 'a', ' ', 't', ' ', 'e', ' ', 'r', ' ', 'f', ' ', 'a', ' ', 'l', ' ', 'l'], ['f', ' ', 'o', ' ', 'r', ' ', 'e', ' ', 's', ' ', 't'], ['w', ' ', 'o', ' ', 'o', ' ', 'd']] [ True  True  True] [0.0, 1.0, 0.0]

LinguList commented 4 years ago

Okay, I had a look at the code, but can you tell me quickly how you make the prediction? Either, which part of the code, or what it does conceptually? So do you go for a threshold in the entropies? Where do I find that code? I think we can still reduce it to only one class, and we do not need the splitter, as we work with the development data only for now, so these can again drastically reduce the code, so that it is easier to write the test for the data...

fractaldragonflies commented 4 years ago

The nltk.py module, DualMarkov class, predict_data function.

This is the method of competing Markov models for predicting entropy. Where the model with the lower entropy prediction wins and becomes the owner/category of the prediction.

This is also why I created a new class for this approach. Entropy calculation is apart and used as input to the decision process. In the native only approach, I use only one Markov model.

Splitter was added by @tresoldi in his initial conversion work, in intent for temporary compatibility. Yes it needs to go away.... also to resolve the issue of empty spaces between segments for formchars and sca.

tresoldi commented 4 years ago

There is code to be better integrated, but for most purposes this was already done. New code integration can always have new issues, more in line with the current organization. I'm closing this.

lingpy / pybor

implement more of the existing code in the new framework #2