First run on an orthography profile

LinguList commented 3 years ago

I made a preliminary orthography profile for the 20 odd languages. @fractaldragonflies, to run this code, please do:

$ git clone https://github.com/intercontinental-dictionary-series/keypano.git
$ git clone https://github.com/concepticon/concepticon-data.git
$ git clone https://github.com/glottolog/glottolog.git
$ git clone https://github.com/cldf-clts/clts.git
$ cd keypano
$ git submodule init
$ git submodule update
$ pip install -e raw/ids/
$ pip install -e ./
$ cldfbench download lexibank_keypano.py
$ cldfbench lexibank.makecldf --concepticon=../concepticon --clts=../clts --glottolog=../glottolog --concepticon-version=v2.4.0 --glottolog-version=v4.3 --clts-version=v2.1.0 lexibank_keypano.py

In this way, you can check progress on the orthography profile in etc/orthography.tsv, and the special-language-profiles in etc/orthography/Spanish.tsv.

The latter is currently being downloaded, using the script in raw/getphonetics.py. This can also be tweaked to account for Portuguese. Download is item-by-item and slow. But we only need the phonetics once.

fractaldragonflies commented 3 years ago

Thank you Mattis.

Downloading glottolog as I write. Created a different env and directory just for this… and similar work.
Already had some of these downloaded and installed, but better to start from scratch here for now I think.

I noticed in the South American WOLD files that several languages had IDS-IDs that did not map to a concepticon_ID. Just like Spanish ‘calma’ does not map to a concepticon_ID, but many more for some languages.

Es una maravilla tener todas las herramientas.

Steep learning curve to have all this on hand and understand how to use it effectively. But I see from what you did on the Saphon data, how effective one can be with access and mastery of these tools.

OK here are install results

Received this warning during installation of keypano both for raw/ids/ and for ./ WARNING: Value for scheme.headers does not match. Please report this to https://github.com/pypa/pip/issues/9617 distutils: /Users/johnmiller/opt/miniforge3/envs/ling/include/python3.9/UNKNOWN sysconfig: /Users/johnmiller/opt/miniforge3/envs/ling/include/python3.9 WARNING: Additional context: user = False home = None root = None prefix = None OK

Ooooops. Error in execution of final command cldfbench install. I changed reference from ../concepticon to ../concepticon-data and command began processing, but then errored out with this trace:

(ling) johnmiller@Johns-M1-Fractal-Dragon keypano % cldfbench lexibank.makecldf --concepticon=../concepticon-data --clts=../clts --glottolog=../glottolog --concepticon-version=v2.4.0 --glottolog-version=v4.3 --clts-version=v2.1.0 lexibank_keypano.py INFO running _cmd_makecldf on keypano ... INFO added sources INFO added languages Traceback (most recent call last):
File "/Users/johnmiller/opt/miniforge3/envs/ling/bin/cldfbench", line 8, in sys.exit(main()) File "/Users/johnmiller/opt/miniforge3/envs/ling/lib/python3.9/site-packages/cldfbench/main.py", line 78, in main return args.main(args) or 0 File "/Users/johnmiller/opt/miniforge3/envs/ling/lib/python3.9/site-packages/pylexibank/commands/makecldf.py", line 23, in run with_dataset(args, 'makecldf', dataset=dataset) File "/Users/johnmiller/opt/miniforge3/envs/ling/lib/python3.9/site-packages/cldfbench/cli_util.py", line 100, in with_dataset res = func(*arg, args) File "/Users/johnmiller/opt/miniforge3/envs/ling/lib/python3.9/site-packages/pylexibank/dataset.py", line 217, in _cmd_makecldf super()._cmd_makecldf(args) File "/Users/johnmiller/opt/miniforge3/envs/ling/lib/python3.9/site-packages/cldfbench/dataset.py", line 214, in _cmd_makecldf self.cmd_makecldf(args) File "./lexibank_keypano.py", line 75, in cmd_makecldf args.writer.add_form( File "/Users/johnmiller/opt/miniforge3/envs/ling/lib/python3.9/site-packages/pylexibank/cldf.py", line 189, in add_form self.tokenize(kw, form, **(dict(profile=profile) if profile else {})) or []) File "/Users/johnmiller/opt/miniforge3/envs/ling/lib/python3.9/site-packages/pylexibank/cldf.py", line 119, in tokenize if self.dataset.tokenizer: File "/Users/johnmiller/opt/miniforge3/envs/ling/lib/python3.9/site-packages/clldutils/misc.py", line 195, in get result = instance.dict[self.name] = self.fget(instance) File "/Users/johnmiller/opt/miniforge3/envs/ling/lib/python3.9/site-packages/pylexibank/dataset.py", line 174, in tokenizer for k, p in self.orthography_profile_dict.items()} File "/Users/johnmiller/opt/miniforge3/envs/ling/lib/python3.9/site-packages/clldutils/misc.py", line 195, in get result = instance.dict[self.name] = self.fget(instance) File "/Users/johnmiller/opt/miniforge3/envs/ling/lib/python3.9/site-packages/pylexibank/dataset.py", line 147, in orthography_profile_dict return {k: Profile.from_file(str(p), form='NFC') for k, p in res.items()} File "/Users/johnmiller/opt/miniforge3/envs/ling/lib/python3.9/site-packages/pylexibank/dataset.py", line 147, in return {k: Profile.from_file(str(p), form='NFC') for k, p in res.items()} File "/Users/johnmiller/opt/miniforge3/envs/ling/lib/python3.9/site-packages/segments/profile.py", line 115, in from_file res = cls( File "/Users/johnmiller/opt/miniforge3/envs/ling/lib/python3.9/site-packages/pylexibank/profile.py", line 38, in init default_spec = list(next(iter(self.graphemes.values())).keys()) StopIteration

OK, maybe it doesn’t like that when I created the env it defaulted to Python 3.9 [Which runs native on my M1].

Maybe tomorrow (Sunday) I’ll create an env with earlier Python 3 to see what conspires.

I’ll keep you posted. OK, a zoom meeting with Roberto on our FST morphology paper! And then a movie on netflix.

Buen Domingo Mattis!

John Miller @.***

On Apr 24, 2021, at 2:09 PM, Johann-Mattis List @.***> wrote:

I made a preliminary orthography profile for the 20 odd languages. @fractaldragonflies https://github.com/fractaldragonflies, to run this code, please do:

$ git clone https://github.com/intercontinental-dictionary-series/keypano.git $ git clone https://github.com/concepticon/concepticon-data.git $ git clone https://github.com/glottolog/glottolog.git $ git clone https://github.com/cldf-clts/clts.git $ cd keypano $ git submodule init $ git submodule update $ pip install -e raw/ids/ $ pip install -e ./ $ cldfbench download lexibank_keypano.py $ cldfbench lexibank.makecldf --concepticon=../concepticon --clts=../clts --glottolog=../glottolog --concepticon-version=v2.4.0 --glottolog-version=v4.3 --clts-version=v2.1.0 lexibank_keypano.py In this way, you can check progress on the orthography profile in etc/orthography.tsv, and the special-language-profiles in etc/orthography/Spanish.tsv.

The latter is currently being downloaded, using the script in raw/getphonetics.py. This can also be tweaked to account for Portuguese. Download is item-by-item and slow. But we only need the phonetics once.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/intercontinental-dictionary-series/keypano/issues/2, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIVSLTV2DKMPQ3JPGHJMWLLTKMJNVANCNFSM43QNTULQ.

LinguList commented 3 years ago

The problem is not python 3.9, I also use 3.9.2. The problem was I added the specific Spanish profile later, which is not finished yet. Now that I fixed this temporarily, you can start ;)

fractaldragonflies commented 3 years ago

Success…

I suppose the reported segment errors (Segments: 167 (72 BIPA errors, 72 CTLS sound class errors, 95 CLTS modified) are expected.

(ling) johnmiller@Johns-M1-Fractal-Dragon keypano % cldfbench lexibank.makecldf --concepticon=../concepticon-data --clts=../clts --glottolog=../glottolog --concepticon-version=v2.4.0 --glottolog-version=v4.3 --clts-version=v2.1.0 lexibank_keypano.py INFO running _cmd_makecldf on keypano ... INFO added sources INFO added languages INFO file written: cldf/.transcription-report.json
INFO Summary for dataset cldf/cldf-metadata.json

Varieties: 22
Concepts: 1,310
Lexemes: 23,232
Sources: 0
Synonymy: 1.24
Invalid lexemes: 0
Tokens: 148,541
Segments: 167 (72 BIPA errors, 72 CTLS sound class errors, 95 CLTS modified)
Inventory size (avg): 38.86 INFO file written: TRANSCRIPTION.md INFO ... done keypano [14.7 secs] WARNING The dataset has no sources at cldf/sources.bib (ling) johnmiller@Johns-M1-Fractal-Dragon keypano %

John Miller @.***

On Apr 25, 2021, at 8:25 AM, Johann-Mattis List @.***> wrote:

The problem is not python 3.9, I also use 3.9.2. The problem was I added the specific Spanish profile later, which is not finished yet. Now that I fixed this temporarily, you can start ;) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/intercontinental-dictionary-series/keypano/issues/2#issuecomment-826324288, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIVSLTQFNLLS2CI6JTKVDZLTKQJ3VANCNFSM43QNTULQ.

LinguList commented 3 years ago

Yes, for sure, we probably need to ignore Spanish and Portugues first and introduce them one time later. My plan is then to put the dataset on EDICTOR, where we can annotate borrowings manually, and cognates as well, to have a better test set.

intercontinental-dictionary-series / keypano

First run on an orthography profile #2