lingpy / pybor

A Python library for borrowing detection based on lexical language models
Apache License 2.0
3 stars 1 forks source link

Lingpy ngram #10

Closed tresoldi closed 4 years ago

tresoldi commented 4 years ago

This PR adds a sketch for using the ngram module currently in lingpy. With the defaults I set (wittenbell, order 3, using length) it gives the following:

$ python examples/ngram_example.py
|           |   True |   False |   Total |
|:----------|-------:|--------:|--------:|
| Positives |  37.00 |   38.00 |      75 |
| Negatives | 455.00 |   71.00 |     526 |
| Total     |   0.82 |    0.18 |     601 |
Precision: 0.49
Recall:    0.34
F-Score:   0.40

While I am not using it in the code yet, I also set a second estimator (b_estimator) that shows how we could combine multiple estimators.

fractaldragonflies commented 4 years ago

I would like to see accuracy as another measure. Both because it is still in common use, but also because we have a baseline to reference, the proportion of loan words in the entire table.

Note also that in my existing code committed to evaluate.py on my branch, I calculate quality measures. I report (in my earlier code) as follows (not pretty but calculation largely done by scipy) - could easily improve on :

precision, recall, F1 = (0.9637305699481865, 0.8985507246376812, 0.93)
n = 235  accuracy = 0.8808510638297873
confusion matrix: tn, fp, fn, tp [ 21   7  21 186]
Predict majority: accuracy= 0.8808510638297873

We can discuss this with our work in progress. I'll approve pull request as well, and maybe we can add accuracy in next iteration.

LinguList commented 4 years ago

How is accuracy computed?

tresoldi commented 4 years ago

I also got the division by zero, because when prototyping I first returned zero all the time. :wink:

I'll fix the conflict with the evaluate.py and merge, and later take care of the tests.

LinguList commented 4 years ago

could you add it to the code in evaluate.py (without third parties) and add tests? But let us know if things are unclear there.

Essentially, you only need to check the script tests/test_evaluate.py.

To evoke tests, type (make sure to pip-install pytest, and pytest-cov):

$ pytest test_evaluate.py --cov=pybor.evaluate

This gives you a summary, to see all missing code blocks, which were not called by the script, so they give you a bad coverage (or a good one):

$ pytest test_evaluate.py --cov=pybor.evaluate --cov-report=html

This creates an html file, which shows a nice representation of your code.

fractaldragonflies commented 4 years ago

Within the evaluate as:

acc = metrics.accuracy_score(ground, forecast)

It's just the number correct divided by the total number of cases.

And to get the majority reference to compare with accuracy I do:

    maxpredict = max(sum(ground),len(ground)-sum(ground))
    maj_acc = maxpredict/len(ground)
tresoldi commented 4 years ago

@fractaldragonflies would you like to add it yourself? You can then experiment with making a new branch and PR only for a small fix.

You'd need to update your master (git checkout master, git pull origin master), create the new branch (git checkout -b branchname), work normally there, and then push and do the PR. Then you'd assign me and/or @LinguList ), we approve and you merge. Later you'd pull again from the origin in your master and, once done, merge master into your branch (better to solve conflicts, if any, as early as possible).

fractaldragonflies commented 4 years ago

OK. I still have open my nltk-john-adapt branch. No problem opening another branch to add accuracy to the evaluate module.

fractaldragonflies commented 4 years ago

I created the add-accuracy branch for this change, and I noticed that cli.py has disappeared! Was this intentional? In my nltk-john-adapt branch (for adapting to our new design) I still have it, although moved to src/pybor.

tresoldi commented 4 years ago

Sorry, missed the comment. @LinguList deleted it, and I removed the link from setup.py. We might later add something equivalent, but for the time being the idea is to put scripts in examples/.

LinguList commented 4 years ago

Yes, I decided that we use examples to USE the library for plotting, we won't need the cli any more, since this library is core. It was an overkill, but I only realized this, when I looked at the code in more detail and then came up with this much simpler schema. It is also for me a learning process, and I had to spend quite some time on seeing how I could do that we all code together and produce code which is in the core testable, while we would not insist that all our examples are really tested.

LinguList commented 4 years ago

BTW: cli won't work without the folder commands. Which is also not there. Let's keep it very small for now, so we can see what the core functionality of the library should be. The next steps, like plotting, will be done in form of examples, unless we realize that they are useful to be made generic.