LinguList commented 2 years ago

I added a first orthography profile to the dataset. This needs to be refined. You can also check my blog post (section on orthography profiles).

I'd ask you to refine it and inform me, once done, or if there are questions.

LinguList commented 2 years ago

You find it in etc/orthography.tsv.

thiagochacon commented 2 years ago

thanks for this. The parser returned several clusters of graphemes, such as <ayú>. For the initial stage of the ortho profile, should we simply convert every graphemee to IPA representations? Such as ayú > ajú ? Ideally we don't want them to be treated as clusters of vowels/diphthongs when tokenized, so ultimately we will need something like

ayú > 2. ajú > 3. a j ú

My question is precisely if we should do first step 2 in order to arrive at step 3 as given above.

LinguList commented 2 years ago

No. The problem is that we cannot guarantee if 'y' is always 'j'. If this WHERE the case, you have to be really sure, and we can do this via our lexibank code.

LinguList commented 2 years ago

If you want me to do this, you can even provide an additional list of rough replacements (but they should be all unique).

thiagochacon commented 2 years ago

where should we provide the "Koch-Grünberg's Graphemes" to IPA conversion table? in thee RAW folder or in ETC? (we have it almost done)

LinguList commented 2 years ago

etc/orthography.tsv. This is where you find the version that I produced for you before.

thiagochacon commented 2 years ago

thanks. but we agreed we would provide you with grapheme to phoneme maps that would make the "orthography.tsv" simpler. do you recall? We have that file now and wanted to share with you so you can run the script again...

LinguList commented 2 years ago

Ah, yes, so long time ago. You can paste this into raw/preprocess-sounds.tsv or similar!

thiagochacon commented 2 years ago

great. we let you know when @MottaAM has added it

MottaAM commented 2 years ago

I just finished converting the symbols from koch-grunberg's notation to IPA. I also made a sheet with the description he gives to each symbol. There are two new files in the raw folder. The one ready to run the script is https://github.com/lexibank/kochgruenbergtukanoan/blob/main/raw/preprocess-sounds.tsv. The other has additional metada. You may run the script now. But If I were to run the conversion script, how would I do it?

LinguList commented 2 years ago

@MottaAM, the replacements contain recursion, which is of course not going to work: you have a -> a: and also a -> a. These are cases that cannot be handled in this form, since we replace a string by itself:

for source, target in replacements:
    string = string.replace(source, target)

So you need to thoroughly clean the entries you provided and make sure that the source form really does not occur in the target form, otherwise, it will be replaced again (!). I suggest to convert to an intermediate format for now, and do the real work with the orthography, as we can clearly see that this form does not work as easy as it was thought.

LinguList commented 2 years ago

To replicate, pull the repository:

$ git pull

Then run (as I mentioned before)

$ cldfbench lexibank.makecldf lexibank_kochgrunbergtukanoan.py

Then, to check the errors, please run:

$ cldfbench lexibank.check_profile lexibank_kochgrunbergtukanoan.py

I paste errors below (and they are all created from the recursive replacement.

LinguList commented 2 years ago

Found 38 generated graphemes

Grapheme	Grapheme-UC	BIPA	BIPA-UC	Segments	Graphemes	Count
ẽ́ː	U+0065 U+0303 U+0301 U+02d0	ẽ́ː	U+0065 U+0303 U+0301 U+02d0	k <<ɔ>> m ẽ́ː p e <<ɾ>> <<ɔ>>	kɔmeẽ́peɾɔ	1
ae	U+0061 U+0065	ae	U+0061 U+0065	i <<ɔ>> t a <<ħ>> ae	iɔtaħae	1
ei̯	U+0065 U+0069 U+032f	ei̯	U+0065 U+0069 U+032f	k <<ɔ>> <<̄>> <<́>> <<ɾ>> <<ɔ>> k <<ɔ>> <<ɺ>> ei̯ k ia	kɔ̄́ɾɔkɔɺei̯kia	1
uiː	U+0075 U+0069 U+02d0	uiː	U+0075 U+0069 U+02d0	<> u x k <<ɨ>> ts i <<ɾ>> uiː	juxkɨtsiɾuii	1
ĩẽ́	U+0069 U+0303 U+0065 U+0303 U+0301	ĩẽ́	U+0069 U+0303 U+0065 U+0303 U+0301	m a x ĩẽ́	maxĩẽ́	1
ẽã	U+0065 U+0303 U+0061 U+0303	ẽã	U+0065 U+0303 U+0061 U+0303	<> ã ĩ m ẽã	jãĩmẽã	1
ẽ́ã	U+0065 U+0303 U+0301 U+0061 U+0303	ẽ́ã	U+0065 U+0303 U+0301 U+0061 U+0303	m ẽ́ã	mẽ́ã	1
ẽ́ẽ	U+0065 U+0303 U+0301 U+0065 U+0303	ẽ́ẽ	U+0065 U+0303 U+0301 U+0065 U+0303	<> i <<ː>> m a s a <<ŋ>> ẽ́ẽ	jiːmasaŋẽ́ẽ	1
ĩ́ːã	U+0069 U+0303 U+0301 U+02d0 U+0061 U+0303	ĩ́ːã	U+0069 U+0303 U+0301 U+02d0 U+0061 U+0303	ã m ĩ́ːã	ãmĩĩ́ã	1
ãẽ́	U+0061 U+0303 U+0065 U+0303 U+0301	ãẽ́	U+0061 U+0303 U+0065 U+0303 U+0301	<<ɨ>> m ãẽ́ g e <<ɾ>> <<ɔ>>	ɨmãẽ́geɾɔ	1
á	U+0061 U+0301	á	U+0061 U+0301	g <<ɔ>> xh á <<ɺ>> i	gɔxháɺi	1
iaː	U+0069 U+0061 U+02d0	iaː	U+0069 U+0061 U+02d0	h <<ɔ>> <<̄>> <<́>> k iaː	hɔ̄́kiaa	1
ẽ́	U+0065 U+0303 U+0301	ẽ́	U+0065 U+0303 U+0301	k <<ɔ>> m ẽ́ p e <<ɾ>> <<ɔ>>	kɔmẽ́peɾɔ	2
ie̯	U+0069 U+0065 U+032f	ie̯	U+0069 U+0065 U+032f	s iua <<ħ>> i k ie̯	siuaħikie̯	2
ẽá	U+0065 U+0303 U+0061 U+0301	ẽá	U+0065 U+0303 U+0061 U+0301	m ẽá p a t <<ɔ>> <<ɺ>> e	mẽápatɔɺe	2
ĩ́	U+0069 U+0303 U+0301	ĩ́	U+0069 U+0303 U+0301	ua x p ĩ́ <<ɔ>> <<̃>> <<̄>> p <<ɛ>>	uaxpĩ́ɔ̃̄pɛ	2
aːi̯	U+0061 U+02d0 U+0069 U+032f	aːi̯	U+0061 U+02d0 U+0069 U+032f	m a <<ɺ>> aːi̯ d <> u	maɺaai̯dju	2
ũẽ́	U+0075 U+0303 U+0065 U+0303 U+0301	ũẽ́	U+0075 U+0303 U+0065 U+0303 U+0301	p ũẽ́	pũẽ́	2
ei	U+0065 U+0069	ei	U+0065 U+0069	ts ei	tsei	2
iu̯	U+0069 U+0075 U+032f	iu̯	U+0069 U+0075 U+032f	h i <<ː>> <<́>> n iu̯	hiː́niu̯	2
ue̯	U+0075 U+0065 U+032f	ue̯	U+0075 U+0065 U+032f	w iː <<ː>> <<́>> ue̯ <<ç>> k a	wiiː́ue̯çka	2
ĩe	U+0069 U+0303 U+0065	ĩe	U+0069 U+0303 U+0065	a <<ː>> b <<ɛ>> _ ts i <<ː>> n ĩe	aːbɛ_tsiːnĩe	3
ĩ́ã	U+0069 U+0303 U+0301 U+0061 U+0303	ĩ́ã	U+0069 U+0303 U+0301 U+0061 U+0303	<> a m ĩ́ã	jamĩ́ã	4
ãẽ	U+0061 U+0303 U+0065 U+0303	ãẽ	U+0061 U+0303 U+0065 U+0303	k <<ɔ>> a m ãẽ	kɔamãẽ	4
iːa	U+0069 U+02d0 U+0061	iːa	U+0069 U+02d0 U+0061	<> a t au̯ iːa	jatau̯iia	5
ui̯	U+0075 U+0069 U+032f	ui̯	U+0075 U+0069 U+032f	ui̯ p <<ɔ>> a	ui̯pɔa	5
i̯	U+0069 U+032f	i̯	U+0069 U+032f	<> e <<ɨ>> ts i <<ɛ>> i̯	jeɨtsiɛi̯	7
ie	U+0069 U+0065	ie	U+0069 U+0065	d ie <<ː>> <<́>> <> p <<ɔ>> <<̄>> <<ɺ>> e <<ɾ>> u	dieː́jpɔ̄ɺeɾu	11
au̯	U+0061 U+0075 U+032f	au̯	U+0061 U+0075 U+032f	d i p au̯ i <> a	dipau̯ija	12
ai̯	U+0061 U+0069 U+032f	ai̯	U+0061 U+0069 U+032f	ua h <<ɔ>> a g a m ai̯	uahɔagamai̯	15
ea	U+0065 U+0061	ea	U+0065 U+0061	i h ia <> ea <<ː>> <<́>> i k a	ihiajeaː́ika	17
ue	U+0075 U+0065	ue	U+0075 U+0065	ue <<ː>> <<ɾ>> e <<ɾ>> i <<ɾ>> u	ueːɾeɾiɾu	18
iu	U+0069 U+0075	iu	U+0069 U+0075	ts iu p u <<ː>> <<́>> <<ɺ>> i <<ɾ>> u	tsiupuː́ɺiɾu	22
au	U+0061 U+0075	au	U+0061 U+0075	<> au <<ɨ>>	jauɨ	44
ai	U+0061 U+0069	ai	U+0061 U+0069	h <<ɔ>> a t a n i k e <<ɾ>> <<ɔ>> k ai k a k a	hɔatanikeɾɔkaikaka	49
ui	U+0075 U+0069	ui	U+0075 U+0069	k ui <<ː>> <<́>> <<ɾ>> i	kuiː́ɾi	54
ua	U+0075 U+0061	ua	U+0075 U+0061	ua <> u p <<ɔ>> n a	uajupɔna	164
ia	U+0069 U+0061	ia	U+0069 U+0061	s ia m <<ɛ>> <<ɺ>> a k <<ɔ>>	siamɛɺakɔ	237

Found 27 unknown graphemes

Grapheme	Diacritics	Unicode	Segments	Graphemes	Count
ai̯á	◌ai̯á	U+0061 U+0069 U+032f U+0061 U+0301	d i s <<ç>> s i p u <<ɺ>> i ts ai̯á n i d e <<ç>> k a	disçsipuɺitsai̯ánideçka	1
uiua	◌uiua	U+0075 U+0069 U+0075 U+0061	uiua h <<ɔ>> a	uiuahɔa	1
š	◌š	U+0073 U+030c	h u <<ʔ>> t š ia	huʔtšia	1
uau	◌uau	U+0075 U+0061 U+0075	uau	uau	1
iau̯i	◌iau̯i	U+0069 U+0061 U+0075 U+032f U+0069	d iau̯i k <<ɨ>>	diau̯ikɨ	1
uaua	◌uaua	U+0075 U+0061 U+0075 U+0061	<<ɔ>> k <<ɔ>> p u k <<ɺ>> uaua	ɔkɔpukɺuaua	1
uaie	◌uaie	U+0075 U+0061 U+0069 U+0065	uaie	uaie	1
uaiua	◌uaiua	U+0075 U+0061 U+0069 U+0075 U+0061	uaiua k a	uaiuaka	1
au̯i	◌au̯i	U+0061 U+0075 U+032f U+0069	au̯i t i <<ɾ>> <<ɨ>>	au̯itiɾɨ	1
p̌	◌p̌	U+0070 U+030c	b i t ai̯ g <<ɔ>> p̌ <<ɛ>> k a	bitai̯gɔp̌ɛka	2
nh	◌nh	U+006e U+0068	p <<ɛ>> a nh ua	pɛanhua	2
eau	◌eau	U+0065 U+0061 U+0075	k eau d <> <<ɨ>>	keaudjɨ	2
mh	◌mh	U+006d U+0068	k a <<ː>> mh a <<ɺ>> u	kaːmhaɺu	4
ḳ	◌ḳ	U+006b U+0323	<> <<ɔ>> ḳ <<ɔ>> <<ɾ>> <<ɔ>>	jɔḳɔɾɔ	4
aue̯	◌aue̯	U+0061 U+0075 U+0065 U+032f	p <<ɛ>> k aue̯	pɛkaue̯	4
iuiia	◌iuiia	U+0069 U+0075 U+0069 U+0069 U+0061	a p i k a <<ɺ>> i k a _ t <<ɛ>> m u <> i <<ː>> <<́>> <<ɺ>> iuiia	apikaɺika_tɛmujiː́ɺiuiia	4
iai	◌iai	U+0069 U+0061 U+0069	s i k iai <<ː>> <<́>> <<ɾ>> <<ɨ>>	sikiaiː́ɾɨ	5
iai̯	◌iai̯	U+0069 U+0061 U+0069 U+032f	<> a m i g a k iai̯ d <> a	jamigakiai̯dja	5
uia	◌uia	U+0075 U+0069 U+0061	m uia	muia	5
aia	◌aia	U+0061 U+0069 U+0061	aia n a m a t i	aianamati	7
uai̯	◌uai̯	U+0075 U+0061 U+0069 U+032f	uai̯ <<ɾ>> u	uai̯ɾu	7
au̯a	◌au̯a	U+0061 U+0075 U+032f U+0061	<> au̯a <<ː>> <<ɺ>> a k a	jau̯aːɺaka	8
xh	◌xh	U+0078 U+0068	ts i <<ː>> u n d u xh a <<ː>> <<́>> k <<ɔ>>	tsiːunduxhaː́kɔ	11
iua	◌iua	U+0069 U+0075 U+0061	s iua <<ħ>> i k ie̯	siuaħikie̯	14
aua	◌aua	U+0061 U+0075 U+0061	p i t aua <<ħ>> <<ɔ>> a	pitauaħɔa	15
xs	◌xs	U+0078 U+0073	g <<ɔ>> xs <<ɔ>>	gɔxsɔ	21
uai	◌uai	U+0075 U+0061 U+0069	uai p i k <<ɔ>> a	uaipikɔa	24

Found 19 graphemes missing in profile

Grapheme	Diacritics	Unicode	Segments	Graphemes	Count
̥	◌̥	U+0325	b e <<ː>> <<̥>> g <<ɨ>>	beː̥gɨ	2
ɑ	◌ɑ	U+0251	k <<ɑ>> u	kɑu	5
̃	◌̃	U+0303	<<ɔ>> <<̃>> ã d iː <<ɺ>> <<ɨ>>	ɔ̃ãdiiɺɨ	12
ŋ	◌ŋ	U+014b	i <<ː>> <<́>> <<ŋ>> i n u	iː́ŋinu	18
̯	◌̯	U+032f	s i g <<ɔ>> i <<̯>> t a g ia <<ħ>> <<ɔ>> i <<ɾ>> i	sigɔi̯tagiaħɔiɾi	24
ʊ	◌ʊ	U+028a	s i <<ɾ>> i s <<ɛ>> p <<ʊ>>	siɾisɛpʊ	34
ʔ	◌ʔ	U+0294	<> a <<ʔ>> k <<ɔ>> a	jaʔkɔa	51
̠	◌̠	U+0320	<<ɔ>> <<̄>> <<́>> m <<ɛ>> t e <<ː>> <<̠>> <<́>> n i	ɔ̄́mɛteː̠́ni	67
ħ	◌ħ	U+0127	s i k <<ɔ>> <<ħ>> i <<ɾ>> i	sikɔħiɾi	78
ç	◌ç	U+0063 U+0327	d i <<ç>> s i <<ː>> <<́>> <<ɾ>> <<ɔ>>	diçsiː́ɾɔ	110
̄	◌̄	U+0304	d <<ɔ>> <<̄>> <<́>> <<ɺ>> <<ɔ>>	dɔ̄́ɺɔ	199
ɾ	◌ɾ	U+027e	n <<ɛ>> <<ː>> <<́>> <<ɾ>> i <<ɾ>> u	nɛː́ɾiɾu	368
ɨ	◌ɨ	U+0268	e <<ː>> <<́>> g <<ɨ>> <<ɺ>> e	eː́gɨɺe	446
ɛ	◌ɛ	U+025b	n <<ɛ>> <<ː>> <<́>> <<ɾ>> i <<ɾ>> u	nɛː́ɾiɾu	535
j	◌j	U+006a	<> e <<ː>> <<́>> <<ɺ>> <<ɨ>>	jeː́ɺɨ	537
́	◌́	U+0301	n <<ɛ>> <<ː>> <<́>> <<ɾ>> i <<ɾ>> u	nɛː́ɾiɾu	581
ɺ	◌ɺ	U+027a	d <<ɔ>> <<̄>> <<́>> <<ɺ>> <<ɔ>>	dɔ̄́ɺɔ	600
ː	◌ː	U+02d0	n <<ɛ>> <<ː>> <<́>> <<ɾ>> i <<ɾ>> u	nɛː́ɾiɾu	834
ɔ	◌ɔ	U+0254	d <<ɔ>> <<̄>> <<́>> <<ɺ>> <<ɔ>>	dɔ̄́ɺɔ	863

LinguList commented 2 years ago

To explain what happens here: you convert e to eː, but you also want to convert é to eː, so the accent is shifted, and é can no longer be found, etc.

That's why we distinguish carefully between replacements (source != target, and source does not occur in target string) from orthoprofiles.

LinguList commented 2 years ago

@thiagochacon @MottaAM, did you see my note?

thiagochacon commented 2 years ago

Yes. @MottaAM is working on creating a new set of tables to iterate the process in a non conflicting way. He is also trying to code it as you instructed. @MottaAM could you share the results you got?

MottaAM commented 2 years ago

@LinguList. Yes, I did see the note. Sorry for the delay. I started working on some transcriptions from Siriano, so I didn't respond to the note immediately. I'm still trying to run the code my myself, but I keep getting the error message below. I'm not sure if I'm creating the virtual environment correctly either.

LinguList commented 2 years ago

You need to run "cldfbench catconfig" first, to configure the location of the concepticon clts and glottolog. I discuss this in parts with additional links here: https://calc.hypotheses.org/2954

MottaAM commented 2 years ago

Thank you for helping! I managed to run the code. I'll start testing different sets of tables right away. I've also noticed that there is a typo in the name of the repository. It's written 'kochgruEnbergtukanoan' instead of 'kochgrunbergtukanoan'

MottaAM commented 2 years ago

The output is big. How can I send it to you just to confirm that I did everything right? I didn't change the conversion table, so it will have the same problem as before

LinguList commented 2 years ago

What do you mean by "output": what the terminal says? With the typo: we write the umlaut ü as ue in German, or in English as well, so I do not consider it as a typo.

MottaAM commented 2 years ago

That's interesting. I didn't know about the 'ue'. When copying the instructions, the words didn't match, so I thought it was a typo. This won't be a problem now. Yes, I mean what the terminal says.

LinguList commented 2 years ago

Is there an error? I'd ask you to just paste the last 50+ lines of terminal output here.

MottaAM commented 2 years ago

here it is

LinguList commented 2 years ago

Hm, do you have difficulties to copy text from your console?

But thanks: the error is pretty clear:

invalid glottocode jupu1235

So you should first check the glottocode.

Missing sources is something we can ignore for now.

MottaAM commented 2 years ago

I just tried copying directly from the console and It worked. I'll copy directly from now on.

I looked for the glottocode of the Yupua language and it's jupu1235. That's odd.

LinguList commented 2 years ago

If you look closely at the glottocode, it is jupu1235 (note the final space), so you need to delete the space in the file etc/languages.csv

MottaAM commented 2 years ago

I fixed it and ran the code again. I'll paste the new output below.

WARNING forms.csv:2291:Source missing source key: KochGrünberg2014
WARNING forms.csv:2292:Source missing source key: KochGrünberg2014
WARNING forms.csv:2293:Source missing source key: KochGrünberg2014
WARNING forms.csv:2294:Source missing source key: KochGrünberg2014
WARNING forms.csv:2295:Source missing source key: KochGrünberg2014
WARNING forms.csv:2296:Source missing source key: KochGrünberg2014
WARNING forms.csv:2297:Source missing source key: KochGrünberg2014
WARNING forms.csv:2298:Source missing source key: KochGrünberg2014
WARNING forms.csv:2299:Source missing source key: KochGrünberg2014
WARNING forms.csv:2300:Source missing source key: KochGrünberg2014
WARNING forms.csv:2301:Source missing source key: KochGrünberg2014
WARNING forms.csv:2302:Source missing source key: KochGrünberg2014
WARNING forms.csv:2303:Source missing source key: KochGrünberg2014
WARNING forms.csv:2304:Source missing source key: KochGrünberg2014
Traceback (most recent call last):
  File "/home/myrho/python-virtual-environments/env/bin/cldfbench", line 8, in <module>
    sys.exit(main())
  File "/home/myrho/python-virtual-environments/env/lib/python3.8/site-packages/cldfbench/__main__.py", line 81, in main
    return args.main(args) or 0
  File "/home/myrho/python-virtual-environments/env/lib/python3.8/site-packages/pylexibank/commands/makecldf.py", line 24, in run
    with_dataset(args, 'makecldf', dataset=dataset)
  File "/home/myrho/python-virtual-environments/env/lib/python3.8/site-packages/cldfbench/cli_util.py", line 153, in with_dataset
    res = func(*arg, args)
  File "/home/myrho/python-virtual-environments/env/lib/python3.8/site-packages/pylexibank/dataset.py", line 231, in _cmd_makecldf
    assert self.cldf_reader().validate(args.log)
AssertionError

LinguList commented 2 years ago

I'll fix now and then let you know, give me 5 minutes.

LinguList commented 2 years ago

Please git-pull what I just modified, the code should run now without problems.

MottaAM commented 2 years ago

Apparently it generated a file, but there is one error. I think it worked.

['ua', 'x', 'p', 'i', '<<ː>>', '<<́>>', 'k', 'i', '<<ɔ>>', '<<ɺ>>', 'i']
['ts', 'iu', '<<ː>>', '<<́>>', 'p', 'u', '<<ɺ>>', 'i']
['t', 'a', '<<ː>>', '<<ɺ>>', 'au', '<<ɔ>>', '<<̯>>', '<<ɺ>>', 'i']
['t', '<<ɨ>>', 'x', 't', 'aː', 'i', 't', '<<ç>>', '<<ɨ>>']
['u', 'n', 'ui']
['<<j>>', 'e', '<<ç>>', 's', 'a', '<<ɺ>>', 'i', 'p', '<<ɨ>>', 'a', '<<ɺ>>', 'i']
['i', 'n', 'a']
['ts', 'a', 'h', 'a']
['<<ɺ>>', 'a', 'h', '<<ɔ>>']
['t', 'ea']
['d', 'a', 'h', '<<ɔ>>']
['b', 'a', '<<ɺ>>', 'a', '<<ɺ>>', 'i', 'k', 'e', '<<ː>>', '<<̠>>', '<<́>>']
INFO    file written: /home/myrho/projeto/kochgruenbergtukanoan/cldf/.transcription-report.json
INFO    Summary for dataset /home/myrho/projeto/kochgruenbergtukanoan/cldf/cldf-metadata.json
- **Varieties:** 4
- **Concepts:** 805
- **Lexemes:** 2,303
- **Sources:** 21
- **Synonymy:** 1.17
INFO    file written: /home/myrho/projeto/kochgruenbergtukanoan/TRANSCRIPTION.md
INFO    file written: /home/myrho/projeto/kochgruenbergtukanoan/cldf/lingpy-rcParams.json
INFO    ... done kochtukanoan [54.2 secs]
WARNING Error importing kochgruenbergtukanoan: No module named 'lexibank_kochgruenbergtukanoan'

LinguList commented 2 years ago

Yes, you have to do now:

pip uninstall kochgruenbergtukanoan
pip install -e .

I changed the names, so we have "kochtukanoan" now, as the old name was too long.

LinguList commented 2 years ago

I now decided to fix the orthoprofile again, so please git-pull again. I just used YOUR replacements and put them in etc/orthography.tsv instead, adding some new which were missing. I ask you to look at cases with a ? in the column IPA, and add the correct IPA, accordingly. In this way, we can see how well the data is converted.

MottaAM commented 2 years ago

Thank you. I'm going to look into it after the New Year. Happy New Year to you!

thiagochacon commented 2 years ago

Great to see things are progressing. Monday I will be back to my office and can help in what is needed. Happy new your for the both of you!!!

MottaAM commented 2 years ago

I fixed the cases with the '?'. I've been thinking if it was a good idea to remove from the conversion table the cases in which the symbols in Koch Grünberg's notation and in the IPA are the same. I think that could solve what caused the problem that happened the first time we tried to run the code.

I ran the code again in a new computer and got the same "No module named 'lexibank_kochgruenbergtukanoan' " error message.

LinguList commented 2 years ago

The name is now "kochtukanoan". I'd ask you to do a fresh virtual environment, as before, and pip install -e ., to install the new kochtukanoan. And then run the command I gave you with lexibank_kochtukanoan.py.

MottaAM commented 2 years ago

I did a fresh install of everything in this new computer and ran the new command. It seems to be working despite the error. Here is commands I ran and the first lines of the output. I'll try to fix the graphemes.


WARNING Error importing kochgruenbergtukanoan: No module named 'lexibank_kochgruenbergtukanoan'
INFO    running check_profile on kochtukanoan ...
WARNING:segments.profile:line 30:duplicate grapheme in profile: ú̠
WARNING:segments.profile:line 45:duplicate grapheme in profile: ã̄
WARNING:segments.profile:line 48:duplicate grapheme in profile: ã̄́
WARNING:segments.profile:line 51:duplicate grapheme in profile: ẽ̄
WARNING:segments.profile:line 54:duplicate grapheme in profile: ẽ̄́
WARNING:segments.profile:line 57:duplicate grapheme in profile: ĩ̄
WARNING:segments.profile:line 60:duplicate grapheme in profile: ĩ̄́
WARNING:segments.profile:line 63:duplicate grapheme in profile: ȭ
WARNING:segments.profile:line 66:duplicate grapheme in profile: ȭ́
WARNING:segments.profile:line 69:duplicate grapheme in profile: ũ̄
WARNING:segments.profile:line 72:duplicate grapheme in profile: ũ̄́
WARNING:segments.profile:line 92:duplicate grapheme in profile: y

LinguList commented 2 years ago

Can you please also pip uninstall lexibank_kochgruenbergtukanoan, as it seems that this is the "error" which is not an error, but a warning, since you seem to have installed in this same virtual environment an old version.

MottaAM commented 2 years ago

It worked.

LinguList commented 2 years ago

Nice, you could now look into the duplicates in the profile and delete the respective rows (see warnings). And then also do

cldfbench lexibank.check_profile lexibank_kochtukanoan.py

This will give you more information (as I pasted above).

MottaAM commented 2 years ago

I deleted all the duplicates and ran the command again. This was the output:

Traceback (most recent call last):
  File "/home/myrho/.local/bin/cldfbench", line 8, in <module>
    sys.exit(main())
  File "/home/myrho/.local/lib/python3.10/site-packages/cldfbench/__main__.py", line 81, in main
    return args.main(args) or 0
  File "/home/myrho/.local/lib/python3.10/site-packages/pylexibank/commands/check_profile.py", line 36, in run
    with_dataset(args, check_profile)
  File "/home/myrho/.local/lib/python3.10/site-packages/cldfbench/cli_util.py", line 153, in with_dataset
    res = func(*arg, args)
  File "/home/myrho/.local/lib/python3.10/site-packages/pylexibank/commands/check_profile.py", line 51, in check_profile
    sound = args.clts.api.bipa[tk]
  File "/home/myrho/.local/lib/python3.10/site-packages/clldutils/misc.py", line 197, in __get__
    result = instance.__dict__[self.__name__] = self.fget(instance)
  File "/home/myrho/.local/lib/python3.10/site-packages/pyclts/api.py", line 23, in bipa
    return self.transcriptionsystem('bipa')
  File "/home/myrho/.local/lib/python3.10/site-packages/pyclts/api.py", line 80, in transcriptionsystem
    if key in self.transcriptionsystem_dict:
  File "/home/myrho/.local/lib/python3.10/site-packages/clldutils/misc.py", line 197, in __get__
    result = instance.__dict__[self.__name__] = self.fget(instance)
  File "/home/myrho/.local/lib/python3.10/site-packages/pyclts/api.py", line 77, in transcriptionsystem_dict
    return {ts.id: ts for ts in self.iter_transcriptionsystem()}
  File "/home/myrho/.local/lib/python3.10/site-packages/pyclts/api.py", line 77, in <dictcomp>
    return {ts.id: ts for ts in self.iter_transcriptionsystem()}
  File "/home/myrho/.local/lib/python3.10/site-packages/pyclts/api.py", line 69, in iter_transcriptionsystem
    yield TranscriptionSystem(
  File "/home/myrho/.local/lib/python3.10/site-packages/pyclts/transcriptionsystem.py", line 77, in __init__
    raise ValueError(
ValueError: Unrecognized features (duration: ultra-long, line 129))

What does it mean?

MottaAM commented 2 years ago

Then I ran cldfbench lexibank.makecldf lexibank_kochtukanoan.py and got the following output:

INFO    Summary for dataset /home/myrho/Documents/Projeto/koch_code/kochtukanoan/cldf/cldf-metadata.json
- **Varieties:** 4
- **Concepts:** 805
- **Lexemes:** 2,303
- **Sources:** 21
- **Synonymy:** 1.17
INFO    file written: /home/myrho/Documents/Projeto/koch_code/kochtukanoan/TRANSCRIPTION.md
INFO    file written: /home/myrho/Documents/Projeto/koch_code/kochtukanoan/cldf/lingpy-rcParams.json
INFO    ... done kochtukanoan [49.0 secs]

What should I do now?

MottaAM commented 2 years ago

I thought it was strange to have the old name 'kochgruenbergtokanoan' installed in a computer I had to make a clean install of everything. So I looked into the pip documentation and saw that when I run pip install -e . it uses the file setup.py. That file still has the old name in it. I think that was what caused that warning I had before. After running pip uninstall kochgruenbergtukanoan, I ran pip install -e . again to test that hypothesis. The warning started appearing again. Then I edited the setup.py file in my computer to have the new name 'kochtukanoan' and ran both commands again. The warning message stopped appearing.

I would like to confirm if that makes sense.

LinguList commented 2 years ago

Please change the name in setup.py, I forgot to do that. Nice spot!

LinguList commented 2 years ago

To debug, please be more specific: which command cased the error message with "ultra-long"? And please push the orthography profile which you modified, and paste also the output of the check_profile command I asked you to check for. In general: when pasting errors, please always paste the command that you ran. This avoids me having to ask for it ;)

MottaAM commented 2 years ago

It was the cldfbench lexibank.check_profile lexibank_kochtukanoan.py that caused the "ultra-long" error message. I am going to paste it again here.


2022-01-11 23:05:14,812 [INFO] ... successfully created the scorer.
2022-01-11 23:05:14,812 [INFO] Model <jaeger> was compiled successfully.
INFO    running check_profile on kochtukanoan ...
Traceback (most recent call last):
  File "/home/myrho/.local/bin/cldfbench", line 8, in <module>
    sys.exit(main())
  File "/home/myrho/.local/lib/python3.10/site-packages/cldfbench/__main__.py", line 81, in main
    return args.main(args) or 0
  File "/home/myrho/.local/lib/python3.10/site-packages/pylexibank/commands/check_profile.py", line 36, in run
    with_dataset(args, check_profile)
  File "/home/myrho/.local/lib/python3.10/site-packages/cldfbench/cli_util.py", line 153, in with_dataset
    res = func(*arg, args)
  File "/home/myrho/.local/lib/python3.10/site-packages/pylexibank/commands/check_profile.py", line 51, in check_profile
    sound = args.clts.api.bipa[tk]
  File "/home/myrho/.local/lib/python3.10/site-packages/clldutils/misc.py", line 197, in __get__
    result = instance.__dict__[self.__name__] = self.fget(instance)
  File "/home/myrho/.local/lib/python3.10/site-packages/pyclts/api.py", line 23, in bipa
    return self.transcriptionsystem('bipa')
  File "/home/myrho/.local/lib/python3.10/site-packages/pyclts/api.py", line 80, in transcriptionsystem
    if key in self.transcriptionsystem_dict:
  File "/home/myrho/.local/lib/python3.10/site-packages/clldutils/misc.py", line 197, in __get__
    result = instance.__dict__[self.__name__] = self.fget(instance)
  File "/home/myrho/.local/lib/python3.10/site-packages/pyclts/api.py", line 77, in transcriptionsystem_dict
    return {ts.id: ts for ts in self.iter_transcriptionsystem()}
  File "/home/myrho/.local/lib/python3.10/site-packages/pyclts/api.py", line 77, in <dictcomp>
    return {ts.id: ts for ts in self.iter_transcriptionsystem()}
  File "/home/myrho/.local/lib/python3.10/site-packages/pyclts/api.py", line 69, in iter_transcriptionsystem
    yield TranscriptionSystem(
  File "/home/myrho/.local/lib/python3.10/site-packages/pyclts/transcriptionsystem.py", line 77, in __init__
    raise ValueError(
ValueError: Unrecognized features (duration: ultra-long, line 129))

LinguList commented 2 years ago

And if you specify the clts version:

cldfbench lexibank.check_profile lexibank_kochtukanoan.py --clts-verison=v1.4

LinguList commented 2 years ago

--clts-version=v2.1.0

LinguList commented 2 years ago

No errors on my side, just pushed code, all looks fine now, close this, if the command workds on your side.

lexibank / kochtukanoan

first orthography profile #9

Found 38 generated graphemes

Found 27 unknown graphemes

Found 19 graphemes missing in profile