Closed fractaldragonflies closed 4 years ago
OK, so tomorrow, I'll return to what @LinguList has proposed and @tresoldi commented and worked on.
Receive an error using the setup.py from master when I do a local -e install. Not sure what to do here so I reverted to the previous setup.py. Related to egg.info. Note, install is on top of previous install, but it doesn't cause problems when using the previous setup.py.
(forlingpy) johnmiller@x86_64-apple-darwin13 monolingual-borrowing-detection % pip install -e .
Obtaining file:///Users/johnmiller/PHD-with-Lingpy/github-archive/monolingual-borrowing-detection
ERROR: Command errored out with exit status 1:
command: /Users/johnmiller/anaconda3/envs/forlingpy/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/Users/johnmiller/PHD-with-Lingpy/github-archive/monolingual-borrowing-detection/setup.py'"'"'; __file__='"'"'/Users/johnmiller/PHD-with-Lingpy/github-archive/monolingual-borrowing-detection/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info
cwd: /Users/johnmiller/PHD-with-Lingpy/github-archive/monolingual-borrowing-detection/
Complete output (6 lines):
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/Users/johnmiller/PHD-with-Lingpy/github-archive/monolingual-borrowing-detection/setup.py", line 5
<<<<<<< HEAD
^
SyntaxError: invalid syntax
----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
This is when things are merged and git cannot tell what the original version is. So it introduces the diff, if you look at the script, it contains <<< elements, etc., they need to be deleted.
import pathlib
from setuptools import setup, find_packages, Extension
# setup package name etc as a default
pkgname = 'pybor'
# The directory containing this file
LOCAL_PATH = pathlib.Path(__file__).parent
# The text of the README file
README_CONTENTS = (LOCAL_PATH / "README.md").read_text()
# Load requirements, so they are listed in a single place
REQUIREMENTS_PATH = LOCAL_PATH / "requirements.txt"
with open(REQUIREMENTS_PATH.as_posix()) as fp:
install_requires = [dep.strip() for dep in fp.readlines()]
setup(
name=pkgname,
description="A Python library for monolingual borrowing detection.",
version='0.1.1',
packages=find_packages(where='src'),
package_dir={'': 'src'},
zip_safe=False,
license="GPL",
include_package_data=True,
install_requires=['cldfbench', 'pyclts', 'lingpy', 'matplotlib'],
url='https://github.com/lingpy/monolingual-borrowing-detection/',
long_description=codecs.open('README.md', 'r', 'utf-8').read(),
long_description_content_type='text/markdown',
entry_points={
'console_scripts': ['pybor=pybor.cli:main'],
},
author='John Miller and Tiago Tresoldi and Johann-Mattis List',
author_email='list@shh.mpg.de',
keywords='borrowing, language contact, sequence comparison'
)
This is the correct version, I'll push it now. My error.
Found out why there are incredibly good entropies for formchars and sca. The data module is delivering formchars and soundclasses with intervening spaces between symbols in the list for each token. Of course, with every other symbol a space, the entropy is reduced.
The data module is a bit strange to me to fix. Any ideas, or should I just remove the extra spaces as work-around for now?
Here is the evidence:
(forlingpy) johnmiller@x86_64-apple-darwin13 monolingual-borrowing-detection % pybor analyse_entropies --sequence=sca
/Users/johnmiller/PHD-with-Lingpy/github-archive/monolingual-borrowing-detection/output
2020-05-05 11:25:22,139 [INFO] loaded wordlist 1814 concepts and 41 languages
sequece example: ['W', ' ', 'E', ' ', 'L', ' ', 'T'] ['P', ' ', 'L', ' ', 'E', ' ', 'N']
prob (ks stat >= 0.14827) = 0.00050
(forlingpy) johnmiller@x86_64-apple-darwin13 monolingual-borrowing-detection % pybor analyse_entropies --sequence=tokens
/Users/johnmiller/PHD-with-Lingpy/github-archive/monolingual-borrowing-detection/output
2020-05-05 11:25:53,044 [INFO] loaded wordlist 1814 concepts and 41 languages
sequece example: ['w', 'ɜː', 'l', 'd'] ['p', 'l', 'eɪ', 'n']
prob (ks stat >= 0.04214) = 0.52697
(forlingpy) johnmiller@x86_64-apple-darwin13 monolingual-borrowing-detection % pybor analyse_entropies --sequence=formchars
/Users/johnmiller/PHD-with-Lingpy/github-archive/monolingual-borrowing-detection/output
2020-05-05 11:26:25,583 [INFO] loaded wordlist 1814 concepts and 41 languages
sequece example: ['w', ' ', 'o', ' ', 'r', ' ', 'l', ' ', 'd'] ['p', ' ', 'l', ' ', 'a', ' ', 'i', ' ', 'n']
Please don't bother to fix. We want to advance by making a new data handling that provides LISTS always, and we were not testing, which shows why this is happening (I was already afraid during the last weeks, that this happened, but did not have time to check): I will make a new data module for the new package, but for porting the code, we have the development data in pybor/dev/, which is sufficient for now.
Yes, and also the code for dealing with that (for as simple and obvious as it is) is already in the reorganization I did last week. Lists all the way.
OK. Ready to merge then, but not sure what GitHub is asking me on resolve conflicts. I added import for codecs to setup.py. I suppose that is the conflict, but I don't understand what action I should take when it presents me with the file to edit.
Tried my experiment for the day. Ported everything related to Markov model (using NLTK still) to the library in this branch. Room for improvement, but I no longer feel like it's such a big leap to moving to the 'predict' focus as previously. Leveraged work Tiago had done previously to help with the port.
I got errors trying to work with the new setup.py file. So I reverted to the previous setup.py file since I wasn't working on pybor for the time being.
I recommend NOT trying analysis with native basis and n>100. It takes a lot of time doing the randomization test. I'm not sure that this is the right command for the randomization test of whether the distributions are different. Since the graphics are fast and the randomization test with native basis is very slow.
Tested with English and once with Hup. With all of formchars, tokens, and sca. Got expected results for tokens, but unexpectedly low average entropies for formchars and for sca. I will investigate to see if I broke something or if I am still getting the expect formchars from the wordlists. Of course we want to drop formchars in favor of something already prepared from the wordlists... I'll be quite pleased to move on in that respect.
detect native-loan basis:
detect native basis:
2-fold validation detect native-loan basis - lacks decent formatting:
2-fold validation detect native basis - with sca - lacks decent formatting:
2-fold entropy reports of training and validation entropies, basis all and basis native: [Average entropy is too good to be true for formchars.