lingpy / pybor

A Python library for borrowing detection based on lexical language models
Apache License 2.0
3 stars 1 forks source link

Experiment john port #7

Closed fractaldragonflies closed 4 years ago

fractaldragonflies commented 4 years ago

Tried my experiment for the day. Ported everything related to Markov model (using NLTK still) to the library in this branch. Room for improvement, but I no longer feel like it's such a big leap to moving to the 'predict' focus as previously. Leveraged work Tiago had done previously to help with the port.

I got errors trying to work with the new setup.py file. So I reverted to the previous setup.py file since I wasn't working on pybor for the time being.

I recommend NOT trying analysis with native basis and n>100. It takes a lot of time doing the randomization test. I'm not sure that this is the right command for the randomization test of whether the distributions are different. Since the graphics are fast and the randomization test with native basis is very slow.

Tested with English and once with Hup. With all of formchars, tokens, and sca. Got expected results for tokens, but unexpectedly low average entropies for formchars and for sca. I will investigate to see if I broke something or if I am still getting the expect formchars from the wordlists. Of course we want to drop formchars in favor of something already prepared from the wordlists... I'll be quite pleased to move on in that respect.

detect native-loan basis:

(forlingpy) johnmiller@x86_64-apple-darwin13 monolingual-borrowing-detection % mobor detect_borrowing --basis='native-loan'
/Users/johnmiller/PHD-with-Lingpy/github-archive/monolingual-borrowing-detection/output
2020-05-04 17:11:48,573 [INFO] loaded wordlist 1814 concepts and 41 languages
native loan basis

* TRAIN RESULTS *
precision, recall, F1 = (0.849624060150376, 0.8071428571428572, 0.8278388278388279)
n = 1212  accuracy = 0.8061056105610561
confusion matrix: tn, fp, fn, tp [412 100 135 565]
Predict majority: accuracy= 0.5775577557755776

* TEST RESULTS *
precision, recall, F1 = (0.7777777777777778, 0.7283236994219653, 0.7522388059701492)
n = 304  accuracy = 0.7269736842105263
confusion matrix: tn, fp, fn, tp [ 95  36  47 126]
Predict majority: accuracy= 0.569078947368421

detect native basis:

(forlingpy) johnmiller@x86_64-apple-darwin13 monolingual-borrowing-detection % mobor detect_borrowing --basis='native'     
/Users/johnmiller/PHD-with-Lingpy/github-archive/monolingual-borrowing-detection/output
2020-05-04 17:12:16,924 [INFO] loaded wordlist 1814 concepts and 41 languages
native basis
Native avg=1.998, stdev=0.320
fraction 0.995, idx 693.51, ref limit=3.415

* TRAIN RESULTS *
precision, recall, F1 = (0.6274864376130199, 0.994269340974212, 0.7694013303769401)
n = 1212  accuracy = 0.6567656765676567
confusion matrix: tn, fp, fn, tp [102 412   4 694]
Predict majority: accuracy= 0.5759075907590759

* TEST RESULTS *
precision, recall, F1 = (0.6078431372549019, 0.8857142857142857, 0.7209302325581395)
n = 304  accuracy = 0.6052631578947368
confusion matrix: tn, fp, fn, tp [ 29 100  20 155]
Predict majority: accuracy= 0.5756578947368421

2-fold validation detect native-loan basis - lacks decent formatting:

(forlingpy) johnmiller@x86_64-apple-darwin13 monolingual-borrowing-detection % mobor detect_borrowing --basis native-loan -k=2
/Users/johnmiller/PHD-with-Lingpy/github-archive/monolingual-borrowing-detection/output
2020-05-04 19:25:07,524 [INFO] loaded wordlist 1814 concepts and 41 languages
native loan basis
Means {'Acc': 0.6912928759894459, 'Maj_acc': 0.5758575197889182, 'Prec': 0.7510042380689147, 'Recall': 0.6940665154950869, 'F1': 0.721378248872222}
StDevs {'Acc': 0.014511873350923465, 'Maj_acc': 0.005936675461741425, 'Prec': 0.02014004053805052, 'Recall': 0.008881330309901736, 'F1': 0.014090315777837314}

2-fold validation detect native basis - with sca - lacks decent formatting:

(forlingpy) johnmiller@x86_64-apple-darwin13 monolingual-borrowing-detection % mobor detect_borrowing --basis native -k=2 --sequence=sca
/Users/johnmiller/PHD-with-Lingpy/github-archive/monolingual-borrowing-detection/output
2020-05-04 19:50:33,024 [INFO] loaded wordlist 1814 concepts and 41 languages
native basis
2-fold validation.
Means {'Acc': 0.5712401055408971, 'Maj_acc': 0.5758575197889182, 'Prec': 0.5748625396327336, 'Recall': 0.9752163216384722, 'F1': 0.7232571108260231}
StDevs {'Acc': 0.02506596306068598, 'Maj_acc': 0.02308707124010556, 'Prec': 0.023280696441536908, 'Recall': 0.01817574884610934, 'F1': 0.023431630895831168}

2-fold entropy reports of training and validation entropies, basis all and basis native: [Average entropy is too good to be true for formchars.

(forlingpy) johnmiller@x86_64-apple-darwin13 monolingual-borrowing-detection % mobor analyse_entropies --language Hup --basis all --sequence=tokens --kfold=2
/Users/johnmiller/PHD-with-Lingpy/github-archive/monolingual-borrowing-detection/output
2020-05-04 21:52:11,747 [INFO] loaded wordlist 1814 concepts and 41 languages
Sample=1179, k-fold=10, val=118, model=kni, order=3, smoothing=0.5.
Statistic: Train mean Train stdev    Val mean  Val stdev
Mean            1.761       0.343       2.590      0.893
StdDev         0.0033      0.0026      0.0542     0.0516
StdErr        0.00111     0.00087     0.01806    0.01719
(forlingpy) johnmiller@x86_64-apple-darwin13 monolingual-borrowing-detection % mobor analyse_entropies --language Hup --basis all --sequence=formchars --kfold=2
/Users/johnmiller/PHD-with-Lingpy/github-archive/monolingual-borrowing-detection/output
2020-05-04 21:53:05,934 [INFO] loaded wordlist 1814 concepts and 41 languages
Sample=1179, k-fold=10, val=118, model=kni, order=3, smoothing=0.5.
Statistic: Train mean Train stdev    Val mean  Val stdev
Mean            1.461       0.248       1.561      0.347
StdDev         0.0022      0.0028      0.0328     0.0689
StdErr        0.00074     0.00095     0.01095    0.02296
(forlingpy) johnmiller@x86_64-apple-darwin13 monolingual-borrowing-detection % mobor analyse_entropies --language Hup --basis native --sequence=formchars --kfold=2
/Users/johnmiller/PHD-with-Lingpy/github-archive/monolingual-borrowing-detection/output
2020-05-04 21:53:42,316 [INFO] loaded wordlist 1814 concepts and 41 languages
Sample=1035, k-fold=10, val=104, model=kni, order=3, smoothing=0.5.
Statistic: Train mean Train stdev    Val mean  Val stdev
Mean            1.417       0.239       1.515      0.337
StdDev         0.0016      0.0028      0.0160     0.0330
StdErr        0.00053     0.00094     0.00533    0.01102
fractaldragonflies commented 4 years ago

OK, so tomorrow, I'll return to what @LinguList has proposed and @tresoldi commented and worked on.

fractaldragonflies commented 4 years ago

Receive an error using the setup.py from master when I do a local -e install. Not sure what to do here so I reverted to the previous setup.py. Related to egg.info. Note, install is on top of previous install, but it doesn't cause problems when using the previous setup.py.

(forlingpy) johnmiller@x86_64-apple-darwin13 monolingual-borrowing-detection % pip install -e .
Obtaining file:///Users/johnmiller/PHD-with-Lingpy/github-archive/monolingual-borrowing-detection
    ERROR: Command errored out with exit status 1:
     command: /Users/johnmiller/anaconda3/envs/forlingpy/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/Users/johnmiller/PHD-with-Lingpy/github-archive/monolingual-borrowing-detection/setup.py'"'"'; __file__='"'"'/Users/johnmiller/PHD-with-Lingpy/github-archive/monolingual-borrowing-detection/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info
         cwd: /Users/johnmiller/PHD-with-Lingpy/github-archive/monolingual-borrowing-detection/
    Complete output (6 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/Users/johnmiller/PHD-with-Lingpy/github-archive/monolingual-borrowing-detection/setup.py", line 5
        <<<<<<< HEAD
        ^
    SyntaxError: invalid syntax
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
LinguList commented 4 years ago

This is when things are merged and git cannot tell what the original version is. So it introduces the diff, if you look at the script, it contains <<< elements, etc., they need to be deleted.

LinguList commented 4 years ago
import pathlib
from setuptools import setup, find_packages, Extension

# setup package name etc as a default
pkgname = 'pybor'

# The directory containing this file
LOCAL_PATH = pathlib.Path(__file__).parent

# The text of the README file
README_CONTENTS = (LOCAL_PATH / "README.md").read_text()

# Load requirements, so they are listed in a single place
REQUIREMENTS_PATH = LOCAL_PATH / "requirements.txt"
with open(REQUIREMENTS_PATH.as_posix()) as fp:
    install_requires = [dep.strip() for dep in fp.readlines()]

setup(
        name=pkgname,
        description="A Python library for monolingual borrowing detection.",
        version='0.1.1',
        packages=find_packages(where='src'),
        package_dir={'': 'src'},
        zip_safe=False,
        license="GPL",
        include_package_data=True,
        install_requires=['cldfbench', 'pyclts', 'lingpy', 'matplotlib'],
        url='https://github.com/lingpy/monolingual-borrowing-detection/',
        long_description=codecs.open('README.md', 'r', 'utf-8').read(),
        long_description_content_type='text/markdown',
        entry_points={
            'console_scripts': ['pybor=pybor.cli:main'],
        },
        author='John Miller and Tiago Tresoldi and Johann-Mattis List',
        author_email='list@shh.mpg.de',
        keywords='borrowing, language contact, sequence comparison'
        )

This is the correct version, I'll push it now. My error.

fractaldragonflies commented 4 years ago

Found out why there are incredibly good entropies for formchars and sca. The data module is delivering formchars and soundclasses with intervening spaces between symbols in the list for each token. Of course, with every other symbol a space, the entropy is reduced.

The data module is a bit strange to me to fix. Any ideas, or should I just remove the extra spaces as work-around for now?

Here is the evidence:

(forlingpy) johnmiller@x86_64-apple-darwin13 monolingual-borrowing-detection % pybor analyse_entropies --sequence=sca
/Users/johnmiller/PHD-with-Lingpy/github-archive/monolingual-borrowing-detection/output
2020-05-05 11:25:22,139 [INFO] loaded wordlist 1814 concepts and 41 languages
sequece example: ['W', ' ', 'E', ' ', 'L', ' ', 'T'] ['P', ' ', 'L', ' ', 'E', ' ', 'N']
prob (ks stat >= 0.14827) = 0.00050
(forlingpy) johnmiller@x86_64-apple-darwin13 monolingual-borrowing-detection % pybor analyse_entropies --sequence=tokens
/Users/johnmiller/PHD-with-Lingpy/github-archive/monolingual-borrowing-detection/output
2020-05-05 11:25:53,044 [INFO] loaded wordlist 1814 concepts and 41 languages
sequece example: ['w', 'ɜː', 'l', 'd'] ['p', 'l', 'eɪ', 'n']
prob (ks stat >= 0.04214) = 0.52697
(forlingpy) johnmiller@x86_64-apple-darwin13 monolingual-borrowing-detection % pybor analyse_entropies --sequence=formchars
/Users/johnmiller/PHD-with-Lingpy/github-archive/monolingual-borrowing-detection/output
2020-05-05 11:26:25,583 [INFO] loaded wordlist 1814 concepts and 41 languages
sequece example: ['w', ' ', 'o', ' ', 'r', ' ', 'l', ' ', 'd'] ['p', ' ', 'l', ' ', 'a', ' ', 'i', ' ', 'n']
LinguList commented 4 years ago

Please don't bother to fix. We want to advance by making a new data handling that provides LISTS always, and we were not testing, which shows why this is happening (I was already afraid during the last weeks, that this happened, but did not have time to check): I will make a new data module for the new package, but for porting the code, we have the development data in pybor/dev/, which is sufficient for now.

tresoldi commented 4 years ago

Yes, and also the code for dealing with that (for as simple and obvious as it is) is already in the reorganization I did last week. Lists all the way.

fractaldragonflies commented 4 years ago

OK. Ready to merge then, but not sure what GitHub is asking me on resolve conflicts. I added import for codecs to setup.py. I suppose that is the conflict, but I don't understand what action I should take when it presents me with the file to edit.