Closed LinguList closed 2 years ago
You find it in etc/orthography.tsv
.
thanks for this. The parser returned several clusters of graphemes, such as <ayú>. For the initial stage of the ortho profile, should we simply convert every graphemee to IPA representations? Such as ayú > ajú ? Ideally we don't want them to be treated as clusters of vowels/diphthongs when tokenized, so ultimately we will need something like
My question is precisely if we should do first step 2 in order to arrive at step 3 as given above.
No. The problem is that we cannot guarantee if 'y' is always 'j'. If this WHERE the case, you have to be really sure, and we can do this via our lexibank code.
If you want me to do this, you can even provide an additional list of rough replacements (but they should be all unique).
where should we provide the "Koch-Grünberg's Graphemes" to IPA conversion table? in thee RAW folder or in ETC? (we have it almost done)
etc/orthography.tsv
. This is where you find the version that I produced for you before.
thanks. but we agreed we would provide you with grapheme to phoneme maps that would make the "orthography.tsv" simpler. do you recall? We have that file now and wanted to share with you so you can run the script again...
Ah, yes, so long time ago. You can paste this into raw/preprocess-sounds.tsv
or similar!
great. we let you know when @MottaAM has added it
I just finished converting the symbols from koch-grunberg's notation to IPA. I also made a sheet with the description he gives to each symbol. There are two new files in the raw folder. The one ready to run the script is https://github.com/lexibank/kochgruenbergtukanoan/blob/main/raw/preprocess-sounds.tsv. The other has additional metada. You may run the script now. But If I were to run the conversion script, how would I do it?
@MottaAM, the replacements contain recursion, which is of course not going to work: you have a -> a:
and also a -> a
. These are cases that cannot be handled in this form, since we replace a string by itself:
for source, target in replacements:
string = string.replace(source, target)
So you need to thoroughly clean the entries you provided and make sure that the source form really does not occur in the target form, otherwise, it will be replaced again (!). I suggest to convert to an intermediate format for now, and do the real work with the orthography, as we can clearly see that this form does not work as easy as it was thought.
To replicate, pull the repository:
$ git pull
Then run (as I mentioned before)
$ cldfbench lexibank.makecldf lexibank_kochgrunbergtukanoan.py
Then, to check the errors, please run:
$ cldfbench lexibank.check_profile lexibank_kochgrunbergtukanoan.py
I paste errors below (and they are all created from the recursive replacement.
Grapheme | Grapheme-UC | BIPA | BIPA-UC | Modified | Segments | Graphemes | Count |
---|---|---|---|---|---|---|---|
ẽ́ː | U+0065 U+0303 U+0301 U+02d0 | ẽ́ː | U+0065 U+0303 U+0301 U+02d0 | k <<ɔ>> m ẽ́ː p e <<ɾ>> <<ɔ>> | kɔmeẽ́peɾɔ | 1 | |
ae | U+0061 U+0065 | ae | U+0061 U+0065 | i <<ɔ>> t a <<ħ>> ae | iɔtaħae | 1 | |
ei̯ | U+0065 U+0069 U+032f | ei̯ | U+0065 U+0069 U+032f | k <<ɔ>> <<̄>> <<́>> <<ɾ>> <<ɔ>> k <<ɔ>> <<ɺ>> ei̯ k ia | kɔ̄́ɾɔkɔɺei̯kia | 1 | |
uiː | U+0075 U+0069 U+02d0 | uiː | U+0075 U+0069 U+02d0 | < |
juxkɨtsiɾuii | 1 | |
ĩẽ́ | U+0069 U+0303 U+0065 U+0303 U+0301 | ĩẽ́ | U+0069 U+0303 U+0065 U+0303 U+0301 | m a x ĩẽ́ | maxĩẽ́ | 1 | |
ẽã | U+0065 U+0303 U+0061 U+0303 | ẽã | U+0065 U+0303 U+0061 U+0303 | < |
jãĩmẽã | 1 | |
ẽ́ã | U+0065 U+0303 U+0301 U+0061 U+0303 | ẽ́ã | U+0065 U+0303 U+0301 U+0061 U+0303 | m ẽ́ã | mẽ́ã | 1 | |
ẽ́ẽ | U+0065 U+0303 U+0301 U+0065 U+0303 | ẽ́ẽ | U+0065 U+0303 U+0301 U+0065 U+0303 | < |
jiːmasaŋẽ́ẽ | 1 | |
ĩ́ːã | U+0069 U+0303 U+0301 U+02d0 U+0061 U+0303 | ĩ́ːã | U+0069 U+0303 U+0301 U+02d0 U+0061 U+0303 | ã m ĩ́ːã | ãmĩĩ́ã | 1 | |
ãẽ́ | U+0061 U+0303 U+0065 U+0303 U+0301 | ãẽ́ | U+0061 U+0303 U+0065 U+0303 U+0301 | <<ɨ>> m ãẽ́ g e <<ɾ>> <<ɔ>> | ɨmãẽ́geɾɔ | 1 | |
á | U+0061 U+0301 | á | U+0061 U+0301 | g <<ɔ>> xh á <<ɺ>> i | gɔxháɺi | 1 | |
iaː | U+0069 U+0061 U+02d0 | iaː | U+0069 U+0061 U+02d0 | h <<ɔ>> <<̄>> <<́>> k iaː | hɔ̄́kiaa | 1 | |
ẽ́ | U+0065 U+0303 U+0301 | ẽ́ | U+0065 U+0303 U+0301 | k <<ɔ>> m ẽ́ p e <<ɾ>> <<ɔ>> | kɔmẽ́peɾɔ | 2 | |
ie̯ | U+0069 U+0065 U+032f | ie̯ | U+0069 U+0065 U+032f | s iua <<ħ>> i k ie̯ | siuaħikie̯ | 2 | |
ẽá | U+0065 U+0303 U+0061 U+0301 | ẽá | U+0065 U+0303 U+0061 U+0301 | m ẽá p a t <<ɔ>> <<ɺ>> e | mẽápatɔɺe | 2 | |
ĩ́ | U+0069 U+0303 U+0301 | ĩ́ | U+0069 U+0303 U+0301 | ua x p ĩ́ <<ɔ>> <<̃>> <<̄>> p <<ɛ>> | uaxpĩ́ɔ̃̄pɛ | 2 | |
aːi̯ | U+0061 U+02d0 U+0069 U+032f | aːi̯ | U+0061 U+02d0 U+0069 U+032f | m a <<ɺ>> aːi̯ d < |
maɺaai̯dju | 2 | |
ũẽ́ | U+0075 U+0303 U+0065 U+0303 U+0301 | ũẽ́ | U+0075 U+0303 U+0065 U+0303 U+0301 | p ũẽ́ | pũẽ́ | 2 | |
ei | U+0065 U+0069 | ei | U+0065 U+0069 | ts ei | tsei | 2 | |
iu̯ | U+0069 U+0075 U+032f | iu̯ | U+0069 U+0075 U+032f | h i <<ː>> <<́>> n iu̯ | hiː́niu̯ | 2 | |
ue̯ | U+0075 U+0065 U+032f | ue̯ | U+0075 U+0065 U+032f | w iː <<ː>> <<́>> ue̯ <<ç>> k a | wiiː́ue̯çka | 2 | |
ĩe | U+0069 U+0303 U+0065 | ĩe | U+0069 U+0303 U+0065 | a <<ː>> b <<ɛ>> _ ts i <<ː>> n ĩe | aːbɛ_tsiːnĩe | 3 | |
ĩ́ã | U+0069 U+0303 U+0301 U+0061 U+0303 | ĩ́ã | U+0069 U+0303 U+0301 U+0061 U+0303 | < |
jamĩ́ã | 4 | |
ãẽ | U+0061 U+0303 U+0065 U+0303 | ãẽ | U+0061 U+0303 U+0065 U+0303 | k <<ɔ>> a m ãẽ | kɔamãẽ | 4 | |
iːa | U+0069 U+02d0 U+0061 | iːa | U+0069 U+02d0 U+0061 | < |
jatau̯iia | 5 | |
ui̯ | U+0075 U+0069 U+032f | ui̯ | U+0075 U+0069 U+032f | ui̯ p <<ɔ>> a | ui̯pɔa | 5 | |
i̯ | U+0069 U+032f | i̯ | U+0069 U+032f | < |
jeɨtsiɛi̯ | 7 | |
ie | U+0069 U+0065 | ie | U+0069 U+0065 | d ie <<ː>> <<́>> < |
dieː́jpɔ̄ɺeɾu | 11 | |
au̯ | U+0061 U+0075 U+032f | au̯ | U+0061 U+0075 U+032f | d i p au̯ i < |
dipau̯ija | 12 | |
ai̯ | U+0061 U+0069 U+032f | ai̯ | U+0061 U+0069 U+032f | ua h <<ɔ>> a g a m ai̯ | uahɔagamai̯ | 15 | |
ea | U+0065 U+0061 | ea | U+0065 U+0061 | i h ia < |
ihiajeaː́ika | 17 | |
ue | U+0075 U+0065 | ue | U+0075 U+0065 | ue <<ː>> <<ɾ>> e <<ɾ>> i <<ɾ>> u | ueːɾeɾiɾu | 18 | |
iu | U+0069 U+0075 | iu | U+0069 U+0075 | ts iu p u <<ː>> <<́>> <<ɺ>> i <<ɾ>> u | tsiupuː́ɺiɾu | 22 | |
au | U+0061 U+0075 | au | U+0061 U+0075 | < |
jauɨ | 44 | |
ai | U+0061 U+0069 | ai | U+0061 U+0069 | h <<ɔ>> a t a n i k e <<ɾ>> <<ɔ>> k ai k a k a | hɔatanikeɾɔkaikaka | 49 | |
ui | U+0075 U+0069 | ui | U+0075 U+0069 | k ui <<ː>> <<́>> <<ɾ>> i | kuiː́ɾi | 54 | |
ua | U+0075 U+0061 | ua | U+0075 U+0061 | ua < |
uajupɔna | 164 | |
ia | U+0069 U+0061 | ia | U+0069 U+0061 | s ia m <<ɛ>> <<ɺ>> a k <<ɔ>> | siamɛɺakɔ | 237 |
Grapheme | Diacritics | Unicode | Segments | Graphemes | Count |
---|---|---|---|---|---|
ai̯á | ◌ai̯á | U+0061 U+0069 U+032f U+0061 U+0301 | d i s <<ç>> s i p u <<ɺ>> i ts ai̯á n i d e <<ç>> k a | disçsipuɺitsai̯ánideçka | 1 |
uiua | ◌uiua | U+0075 U+0069 U+0075 U+0061 | uiua h <<ɔ>> a | uiuahɔa | 1 |
š | ◌š | U+0073 U+030c | h u <<ʔ>> t š ia | huʔtšia | 1 |
uau | ◌uau | U+0075 U+0061 U+0075 | uau | uau | 1 |
iau̯i | ◌iau̯i | U+0069 U+0061 U+0075 U+032f U+0069 | d iau̯i k <<ɨ>> | diau̯ikɨ | 1 |
uaua | ◌uaua | U+0075 U+0061 U+0075 U+0061 | <<ɔ>> k <<ɔ>> p u k <<ɺ>> uaua | ɔkɔpukɺuaua | 1 |
uaie | ◌uaie | U+0075 U+0061 U+0069 U+0065 | uaie | uaie | 1 |
uaiua | ◌uaiua | U+0075 U+0061 U+0069 U+0075 U+0061 | uaiua k a | uaiuaka | 1 |
au̯i | ◌au̯i | U+0061 U+0075 U+032f U+0069 | au̯i t i <<ɾ>> <<ɨ>> | au̯itiɾɨ | 1 |
p̌ | ◌p̌ | U+0070 U+030c | b i t ai̯ g <<ɔ>> p̌ <<ɛ>> k a | bitai̯gɔp̌ɛka | 2 |
nh | ◌nh | U+006e U+0068 | p <<ɛ>> a nh ua | pɛanhua | 2 |
eau | ◌eau | U+0065 U+0061 U+0075 | k eau d < |
keaudjɨ | 2 |
mh | ◌mh | U+006d U+0068 | k a <<ː>> mh a <<ɺ>> u | kaːmhaɺu | 4 |
ḳ | ◌ḳ | U+006b U+0323 | < |
jɔḳɔɾɔ | 4 |
aue̯ | ◌aue̯ | U+0061 U+0075 U+0065 U+032f | p <<ɛ>> k aue̯ | pɛkaue̯ | 4 |
iuiia | ◌iuiia | U+0069 U+0075 U+0069 U+0069 U+0061 | a p i k a <<ɺ>> i k a _ t <<ɛ>> m u < |
apikaɺika_tɛmujiː́ɺiuiia | 4 |
iai | ◌iai | U+0069 U+0061 U+0069 | s i k iai <<ː>> <<́>> <<ɾ>> <<ɨ>> | sikiaiː́ɾɨ | 5 |
iai̯ | ◌iai̯ | U+0069 U+0061 U+0069 U+032f | < |
jamigakiai̯dja | 5 |
uia | ◌uia | U+0075 U+0069 U+0061 | m uia | muia | 5 |
aia | ◌aia | U+0061 U+0069 U+0061 | aia n a m a t i | aianamati | 7 |
uai̯ | ◌uai̯ | U+0075 U+0061 U+0069 U+032f | uai̯ <<ɾ>> u | uai̯ɾu | 7 |
au̯a | ◌au̯a | U+0061 U+0075 U+032f U+0061 | < |
jau̯aːɺaka | 8 |
xh | ◌xh | U+0078 U+0068 | ts i <<ː>> u n d u xh a <<ː>> <<́>> k <<ɔ>> | tsiːunduxhaː́kɔ | 11 |
iua | ◌iua | U+0069 U+0075 U+0061 | s iua <<ħ>> i k ie̯ | siuaħikie̯ | 14 |
aua | ◌aua | U+0061 U+0075 U+0061 | p i t aua <<ħ>> <<ɔ>> a | pitauaħɔa | 15 |
xs | ◌xs | U+0078 U+0073 | g <<ɔ>> xs <<ɔ>> | gɔxsɔ | 21 |
uai | ◌uai | U+0075 U+0061 U+0069 | uai p i k <<ɔ>> a | uaipikɔa | 24 |
Grapheme | Diacritics | Unicode | Segments | Graphemes | Count |
---|---|---|---|---|---|
̥ | ◌̥ | U+0325 | b e <<ː>> <<̥>> g <<ɨ>> | beː̥gɨ | 2 |
ɑ | ◌ɑ | U+0251 | k <<ɑ>> u | kɑu | 5 |
̃ | ◌̃ | U+0303 | <<ɔ>> <<̃>> ã d iː <<ɺ>> <<ɨ>> | ɔ̃ãdiiɺɨ | 12 |
ŋ | ◌ŋ | U+014b | i <<ː>> <<́>> <<ŋ>> i n u | iː́ŋinu | 18 |
̯ | ◌̯ | U+032f | s i g <<ɔ>> i <<̯>> t a g ia <<ħ>> <<ɔ>> i <<ɾ>> i | sigɔi̯tagiaħɔiɾi | 24 |
ʊ | ◌ʊ | U+028a | s i <<ɾ>> i s <<ɛ>> p <<ʊ>> | siɾisɛpʊ | 34 |
ʔ | ◌ʔ | U+0294 | < |
jaʔkɔa | 51 |
̠ | ◌̠ | U+0320 | <<ɔ>> <<̄>> <<́>> m <<ɛ>> t e <<ː>> <<̠>> <<́>> n i | ɔ̄́mɛteː̠́ni | 67 |
ħ | ◌ħ | U+0127 | s i k <<ɔ>> <<ħ>> i <<ɾ>> i | sikɔħiɾi | 78 |
ç | ◌ç | U+0063 U+0327 | d i <<ç>> s i <<ː>> <<́>> <<ɾ>> <<ɔ>> | diçsiː́ɾɔ | 110 |
̄ | ◌̄ | U+0304 | d <<ɔ>> <<̄>> <<́>> <<ɺ>> <<ɔ>> | dɔ̄́ɺɔ | 199 |
ɾ | ◌ɾ | U+027e | n <<ɛ>> <<ː>> <<́>> <<ɾ>> i <<ɾ>> u | nɛː́ɾiɾu | 368 |
ɨ | ◌ɨ | U+0268 | e <<ː>> <<́>> g <<ɨ>> <<ɺ>> e | eː́gɨɺe | 446 |
ɛ | ◌ɛ | U+025b | n <<ɛ>> <<ː>> <<́>> <<ɾ>> i <<ɾ>> u | nɛː́ɾiɾu | 535 |
j | ◌j | U+006a | < |
jeː́ɺɨ | 537 |
́ | ◌́ | U+0301 | n <<ɛ>> <<ː>> <<́>> <<ɾ>> i <<ɾ>> u | nɛː́ɾiɾu | 581 |
ɺ | ◌ɺ | U+027a | d <<ɔ>> <<̄>> <<́>> <<ɺ>> <<ɔ>> | dɔ̄́ɺɔ | 600 |
ː | ◌ː | U+02d0 | n <<ɛ>> <<ː>> <<́>> <<ɾ>> i <<ɾ>> u | nɛː́ɾiɾu | 834 |
ɔ | ◌ɔ | U+0254 | d <<ɔ>> <<̄>> <<́>> <<ɺ>> <<ɔ>> | dɔ̄́ɺɔ | 863 |
To explain what happens here: you convert e
to eː
, but you also want to convert é
to eː
, so the accent is shifted, and é
can no longer be found, etc.
That's why we distinguish carefully between replacements (source != target, and source does not occur in target string) from orthoprofiles.
@thiagochacon @MottaAM, did you see my note?
Yes. @MottaAM is working on creating a new set of tables to iterate the process in a non conflicting way. He is also trying to code it as you instructed. @MottaAM could you share the results you got?
@LinguList. Yes, I did see the note. Sorry for the delay. I started working on some transcriptions from Siriano, so I didn't respond to the note immediately. I'm still trying to run the code my myself, but I keep getting the error message below. I'm not sure if I'm creating the virtual environment correctly either.
You need to run "cldfbench catconfig" first, to configure the location of the concepticon clts and glottolog. I discuss this in parts with additional links here: https://calc.hypotheses.org/2954
Thank you for helping! I managed to run the code. I'll start testing different sets of tables right away. I've also noticed that there is a typo in the name of the repository. It's written 'kochgruEnbergtukanoan' instead of 'kochgrunbergtukanoan'
The output is big. How can I send it to you just to confirm that I did everything right? I didn't change the conversion table, so it will have the same problem as before
What do you mean by "output": what the terminal says? With the typo: we write the umlaut ü as ue in German, or in English as well, so I do not consider it as a typo.
That's interesting. I didn't know about the 'ue'. When copying the instructions, the words didn't match, so I thought it was a typo. This won't be a problem now. Yes, I mean what the terminal says.
Is there an error? I'd ask you to just paste the last 50+ lines of terminal output here.
here it is
Hm, do you have difficulties to copy text from your console?
But thanks: the error is pretty clear:
invalid glottocode jupu1235
So you should first check the glottocode.
Missing sources is something we can ignore for now.
I just tried copying directly from the console and It worked. I'll copy directly from now on.
I looked for the glottocode of the Yupua language and it's jupu1235. That's odd.
If you look closely at the glottocode, it is jupu1235
(note the final space), so you need to delete the space in the file etc/languages.csv
I fixed it and ran the code again. I'll paste the new output below.
WARNING forms.csv:2291:Source missing source key: KochGrünberg2014
WARNING forms.csv:2292:Source missing source key: KochGrünberg2014
WARNING forms.csv:2293:Source missing source key: KochGrünberg2014
WARNING forms.csv:2294:Source missing source key: KochGrünberg2014
WARNING forms.csv:2295:Source missing source key: KochGrünberg2014
WARNING forms.csv:2296:Source missing source key: KochGrünberg2014
WARNING forms.csv:2297:Source missing source key: KochGrünberg2014
WARNING forms.csv:2298:Source missing source key: KochGrünberg2014
WARNING forms.csv:2299:Source missing source key: KochGrünberg2014
WARNING forms.csv:2300:Source missing source key: KochGrünberg2014
WARNING forms.csv:2301:Source missing source key: KochGrünberg2014
WARNING forms.csv:2302:Source missing source key: KochGrünberg2014
WARNING forms.csv:2303:Source missing source key: KochGrünberg2014
WARNING forms.csv:2304:Source missing source key: KochGrünberg2014
Traceback (most recent call last):
File "/home/myrho/python-virtual-environments/env/bin/cldfbench", line 8, in <module>
sys.exit(main())
File "/home/myrho/python-virtual-environments/env/lib/python3.8/site-packages/cldfbench/__main__.py", line 81, in main
return args.main(args) or 0
File "/home/myrho/python-virtual-environments/env/lib/python3.8/site-packages/pylexibank/commands/makecldf.py", line 24, in run
with_dataset(args, 'makecldf', dataset=dataset)
File "/home/myrho/python-virtual-environments/env/lib/python3.8/site-packages/cldfbench/cli_util.py", line 153, in with_dataset
res = func(*arg, args)
File "/home/myrho/python-virtual-environments/env/lib/python3.8/site-packages/pylexibank/dataset.py", line 231, in _cmd_makecldf
assert self.cldf_reader().validate(args.log)
AssertionError
I'll fix now and then let you know, give me 5 minutes.
Please git-pull what I just modified, the code should run now without problems.
Apparently it generated a file, but there is one error. I think it worked.
['ua', 'x', 'p', 'i', '<<ː>>', '<<́>>', 'k', 'i', '<<ɔ>>', '<<ɺ>>', 'i']
['ts', 'iu', '<<ː>>', '<<́>>', 'p', 'u', '<<ɺ>>', 'i']
['t', 'a', '<<ː>>', '<<ɺ>>', 'au', '<<ɔ>>', '<<̯>>', '<<ɺ>>', 'i']
['t', '<<ɨ>>', 'x', 't', 'aː', 'i', 't', '<<ç>>', '<<ɨ>>']
['u', 'n', 'ui']
['<<j>>', 'e', '<<ç>>', 's', 'a', '<<ɺ>>', 'i', 'p', '<<ɨ>>', 'a', '<<ɺ>>', 'i']
['i', 'n', 'a']
['ts', 'a', 'h', 'a']
['<<ɺ>>', 'a', 'h', '<<ɔ>>']
['t', 'ea']
['d', 'a', 'h', '<<ɔ>>']
['b', 'a', '<<ɺ>>', 'a', '<<ɺ>>', 'i', 'k', 'e', '<<ː>>', '<<̠>>', '<<́>>']
INFO file written: /home/myrho/projeto/kochgruenbergtukanoan/cldf/.transcription-report.json
INFO Summary for dataset /home/myrho/projeto/kochgruenbergtukanoan/cldf/cldf-metadata.json
- **Varieties:** 4
- **Concepts:** 805
- **Lexemes:** 2,303
- **Sources:** 21
- **Synonymy:** 1.17
INFO file written: /home/myrho/projeto/kochgruenbergtukanoan/TRANSCRIPTION.md
INFO file written: /home/myrho/projeto/kochgruenbergtukanoan/cldf/lingpy-rcParams.json
INFO ... done kochtukanoan [54.2 secs]
WARNING Error importing kochgruenbergtukanoan: No module named 'lexibank_kochgruenbergtukanoan'
Yes, you have to do now:
pip uninstall kochgruenbergtukanoan
pip install -e .
I changed the names, so we have "kochtukanoan" now, as the old name was too long.
I now decided to fix the orthoprofile again, so please git-pull again. I just used YOUR replacements and put them in etc/orthography.tsv
instead, adding some new which were missing. I ask you to look at cases with a ? in the column IPA, and add the correct IPA, accordingly. In this way, we can see how well the data is converted.
Thank you. I'm going to look into it after the New Year. Happy New Year to you!
Great to see things are progressing. Monday I will be back to my office and can help in what is needed. Happy new your for the both of you!!!
I fixed the cases with the '?'. I've been thinking if it was a good idea to remove from the conversion table the cases in which the symbols in Koch Grünberg's notation and in the IPA are the same. I think that could solve what caused the problem that happened the first time we tried to run the code.
I ran the code again in a new computer and got the same "No module named 'lexibank_kochgruenbergtukanoan' " error message.
The name is now "kochtukanoan". I'd ask you to do a fresh virtual environment, as before, and pip install -e .
, to install the new kochtukanoan
. And then run the command I gave you with lexibank_kochtukanoan.py
.
I did a fresh install of everything in this new computer and ran the new command. It seems to be working despite the error. Here is commands I ran and the first lines of the output. I'll try to fix the graphemes.
WARNING Error importing kochgruenbergtukanoan: No module named 'lexibank_kochgruenbergtukanoan'
INFO running check_profile on kochtukanoan ...
WARNING:segments.profile:line 30:duplicate grapheme in profile: ú̠
WARNING:segments.profile:line 45:duplicate grapheme in profile: ã̄
WARNING:segments.profile:line 48:duplicate grapheme in profile: ã̄́
WARNING:segments.profile:line 51:duplicate grapheme in profile: ẽ̄
WARNING:segments.profile:line 54:duplicate grapheme in profile: ẽ̄́
WARNING:segments.profile:line 57:duplicate grapheme in profile: ĩ̄
WARNING:segments.profile:line 60:duplicate grapheme in profile: ĩ̄́
WARNING:segments.profile:line 63:duplicate grapheme in profile: ȭ
WARNING:segments.profile:line 66:duplicate grapheme in profile: ȭ́
WARNING:segments.profile:line 69:duplicate grapheme in profile: ũ̄
WARNING:segments.profile:line 72:duplicate grapheme in profile: ũ̄́
WARNING:segments.profile:line 92:duplicate grapheme in profile: y
Can you please also pip uninstall lexibank_kochgruenbergtukanoan
, as it seems that this is the "error" which is not an error, but a warning, since you seem to have installed in this same virtual environment an old version.
It worked.
Nice, you could now look into the duplicates in the profile and delete the respective rows (see warnings). And then also do
cldfbench lexibank.check_profile lexibank_kochtukanoan.py
This will give you more information (as I pasted above).
I deleted all the duplicates and ran the command again. This was the output:
Traceback (most recent call last):
File "/home/myrho/.local/bin/cldfbench", line 8, in <module>
sys.exit(main())
File "/home/myrho/.local/lib/python3.10/site-packages/cldfbench/__main__.py", line 81, in main
return args.main(args) or 0
File "/home/myrho/.local/lib/python3.10/site-packages/pylexibank/commands/check_profile.py", line 36, in run
with_dataset(args, check_profile)
File "/home/myrho/.local/lib/python3.10/site-packages/cldfbench/cli_util.py", line 153, in with_dataset
res = func(*arg, args)
File "/home/myrho/.local/lib/python3.10/site-packages/pylexibank/commands/check_profile.py", line 51, in check_profile
sound = args.clts.api.bipa[tk]
File "/home/myrho/.local/lib/python3.10/site-packages/clldutils/misc.py", line 197, in __get__
result = instance.__dict__[self.__name__] = self.fget(instance)
File "/home/myrho/.local/lib/python3.10/site-packages/pyclts/api.py", line 23, in bipa
return self.transcriptionsystem('bipa')
File "/home/myrho/.local/lib/python3.10/site-packages/pyclts/api.py", line 80, in transcriptionsystem
if key in self.transcriptionsystem_dict:
File "/home/myrho/.local/lib/python3.10/site-packages/clldutils/misc.py", line 197, in __get__
result = instance.__dict__[self.__name__] = self.fget(instance)
File "/home/myrho/.local/lib/python3.10/site-packages/pyclts/api.py", line 77, in transcriptionsystem_dict
return {ts.id: ts for ts in self.iter_transcriptionsystem()}
File "/home/myrho/.local/lib/python3.10/site-packages/pyclts/api.py", line 77, in <dictcomp>
return {ts.id: ts for ts in self.iter_transcriptionsystem()}
File "/home/myrho/.local/lib/python3.10/site-packages/pyclts/api.py", line 69, in iter_transcriptionsystem
yield TranscriptionSystem(
File "/home/myrho/.local/lib/python3.10/site-packages/pyclts/transcriptionsystem.py", line 77, in __init__
raise ValueError(
ValueError: Unrecognized features (duration: ultra-long, line 129))
What does it mean?
Then I ran cldfbench lexibank.makecldf lexibank_kochtukanoan.py
and got the following output:
INFO Summary for dataset /home/myrho/Documents/Projeto/koch_code/kochtukanoan/cldf/cldf-metadata.json
- **Varieties:** 4
- **Concepts:** 805
- **Lexemes:** 2,303
- **Sources:** 21
- **Synonymy:** 1.17
INFO file written: /home/myrho/Documents/Projeto/koch_code/kochtukanoan/TRANSCRIPTION.md
INFO file written: /home/myrho/Documents/Projeto/koch_code/kochtukanoan/cldf/lingpy-rcParams.json
INFO ... done kochtukanoan [49.0 secs]
What should I do now?
I thought it was strange to have the old name 'kochgruenbergtokanoan' installed in a computer I had to make a clean install of everything. So I looked into the pip documentation and saw that when I run pip install -e .
it uses the file setup.py. That file still has the old name in it. I think that was what caused that warning I had before.
After running pip uninstall kochgruenbergtukanoan
, I ran pip install -e .
again to test that hypothesis. The warning started appearing again.
Then I edited the setup.py file in my computer to have the new name 'kochtukanoan' and ran both commands again. The warning message stopped appearing.
I would like to confirm if that makes sense.
Please change the name in setup.py, I forgot to do that. Nice spot!
To debug, please be more specific: which command cased the error message with "ultra-long"? And please push the orthography profile which you modified, and paste also the output of the check_profile
command I asked you to check for. In general: when pasting errors, please always paste the command that you ran. This avoids me having to ask for it ;)
It was the cldfbench lexibank.check_profile lexibank_kochtukanoan.py
that caused the "ultra-long" error message. I am going to paste it again here.
2022-01-11 23:05:14,812 [INFO] ... successfully created the scorer.
2022-01-11 23:05:14,812 [INFO] Model <jaeger> was compiled successfully.
INFO running check_profile on kochtukanoan ...
Traceback (most recent call last):
File "/home/myrho/.local/bin/cldfbench", line 8, in <module>
sys.exit(main())
File "/home/myrho/.local/lib/python3.10/site-packages/cldfbench/__main__.py", line 81, in main
return args.main(args) or 0
File "/home/myrho/.local/lib/python3.10/site-packages/pylexibank/commands/check_profile.py", line 36, in run
with_dataset(args, check_profile)
File "/home/myrho/.local/lib/python3.10/site-packages/cldfbench/cli_util.py", line 153, in with_dataset
res = func(*arg, args)
File "/home/myrho/.local/lib/python3.10/site-packages/pylexibank/commands/check_profile.py", line 51, in check_profile
sound = args.clts.api.bipa[tk]
File "/home/myrho/.local/lib/python3.10/site-packages/clldutils/misc.py", line 197, in __get__
result = instance.__dict__[self.__name__] = self.fget(instance)
File "/home/myrho/.local/lib/python3.10/site-packages/pyclts/api.py", line 23, in bipa
return self.transcriptionsystem('bipa')
File "/home/myrho/.local/lib/python3.10/site-packages/pyclts/api.py", line 80, in transcriptionsystem
if key in self.transcriptionsystem_dict:
File "/home/myrho/.local/lib/python3.10/site-packages/clldutils/misc.py", line 197, in __get__
result = instance.__dict__[self.__name__] = self.fget(instance)
File "/home/myrho/.local/lib/python3.10/site-packages/pyclts/api.py", line 77, in transcriptionsystem_dict
return {ts.id: ts for ts in self.iter_transcriptionsystem()}
File "/home/myrho/.local/lib/python3.10/site-packages/pyclts/api.py", line 77, in <dictcomp>
return {ts.id: ts for ts in self.iter_transcriptionsystem()}
File "/home/myrho/.local/lib/python3.10/site-packages/pyclts/api.py", line 69, in iter_transcriptionsystem
yield TranscriptionSystem(
File "/home/myrho/.local/lib/python3.10/site-packages/pyclts/transcriptionsystem.py", line 77, in __init__
raise ValueError(
ValueError: Unrecognized features (duration: ultra-long, line 129))
And if you specify the clts version:
cldfbench lexibank.check_profile lexibank_kochtukanoan.py --clts-verison=v1.4
--clts-version=v2.1.0
No errors on my side, just pushed code, all looks fine now, close this, if the command workds on your side.
I added a first orthography profile to the dataset. This needs to be refined. You can also check my blog post (section on orthography profiles).
I'd ask you to refine it and inform me, once done, or if there are questions.