check the prosody - Githubissues

LinguList commented 3 years ago

Type	ID	Value	Form	Graphemes	Segments
1	LixianXuechengrGyalrong-1100_third-1	kə- suaŋ	kə- suaŋ	^ k ə - + s u a ŋ $	k ə + + s u a ŋ
1	LixianXuechengrGyalrong-82_voidfecesvt-1	tə- pʦʰi kə-lɛ	tə- pʦʰi kə-lɛ	^ t ə - + p ʦʰ i + k ə - l ɛ $	t ə + + p tsʰ i + k ə + l ɛ
1	LixianXuechengrGyalrong-83_voidfecesvi-1	tə- pʦʰi da kʂɿ	tə- pʦʰi da kʂɿ	^ t ə - + p ʦʰ i + d a + k ʂ ɿ $	t ə + + p tsʰ i + d a + k ʂ z̩
1	MaerkangBolarGyalrongA-1185_already-1	... ɕes44	... ɕes44	^ . . . + ɕ e s 44 $	+ ɕ e s ⁴⁴
1	MaerkangBolarGyalrongA-1188_to-1	-i	-i	^ - i $	+ i
1	MaerkangBolarGyalrongA-1192_in-1	-i	-i	^ - i $	+ i
1	MaerkangBolarGyalrongA-1193_aton-1	-i	-i	^ - i $	+ i
1	MaerkangBolarGyalrongA-1196_of-1	[A] wu-[B]	wu-	^ w u - $	w u +
1	MaerkangBolarGyalrongA-1200_only-1	... zɨ22 me44	... zɨ22 me44	^ . . . + z ɨ 22 + m e 44 $	+ z ɨ ²² + m e ⁴⁴
1	MaerkangBolarGyalrongA-1201_exceptfor-1	... zɨ22 me44	... zɨ22 me44	^ . . . + z ɨ 22 + m e 44 $	+ z ɨ ²² + m e ⁴⁴
1	MaerkangBolarGyalrongA-1202_only-1	... zɨ22 me44	... zɨ22 me44	^ . . . + z ɨ 22 + m e 44 $	+ z ɨ ²² + m e ⁴⁴
1	MaerkangBolarGyalrongA-1222_becomev-1	... ta44 pa22 o44	... ta44 pa22 o44	^ . . . + t a 44 + p a 22 + o 44 $	+ t a ⁴⁴ + p a ²² + o ⁴⁴
1	MaerkangBolarGyalrongA-1224_becomev-1	... ta44 pa22 o44	... ta44 pa22 o44	^ . . . + t a 44 + p a 22 + o 44 $	+ t a ⁴⁴ + p a ²² + o ⁴⁴
1	MaerkangBolarGyalrongB-1188_to-1	-j	-j	^ - j $	+ j
1	MaerkangBolarGyalrongB-1197_also-1	-j	-j	^ - j $	+ j
1	MaerkangCaodengrGyalrong-1201_exceptfor-1	...kǝ44 ma22 …	...kǝ44 ma22 …	^ . . . k ǝ 44 + m a 22 + … $	k ə ⁴⁴ + m a ²² +
1	MaerkangJaphugrGyalrong-1253_goodbye-1	sɤrma je !	sɤrma je !	^ s ɤ r m a + j e + ! $	s ɤ r m a + j e +
1	MaerkangSomanrGyalrong-104_skin-1	tǝ-	tǝ-	^ t ǝ - $	t ə +
1	MaerkangSomanrGyalrong-1081_seventy-1	kə- ʃnəs ʃʧᴇ	kə- ʃnəs ʃʧᴇ	^ k ə - + ʃ n ə s + ʃ ʧ ᴇ $	k ə + + ʃ n ə s + ʃ tʃ ɛ
1	MaerkangSomanrGyalrong-146_flax-1	ta- sa	ta- sa	^ t a - + s a $	t a + + s a
1	MaerkangSomanrGyalrong-148_fur-1	tə-	tə-	^ t ə - $	t ə +
1	MaerkangSomanrGyalrong-149_tannedleather-1	tə-	tə-	^ t ə - $	t ə +
1	MaerkangSomanrGyalrong-203_getboiledvi-1	kə- sʦo	kə- sʦo	^ k ə - + s ʦ o $	k ə + + s ts o
1	MaerkangSomanrGyalrong-212_suckvt-1	ka- mə sʨup	ka- mə sʨup	^ k a - + m ə + s ʨ u p $	k a + + m ə + s tɕ u p
1	MaerkangSomanrGyalrong-284_closevi-1	ka- ʧat	ka- ʧat	^ k a - + ʧ a t $	k a + + tʃ a t
1	MaerkangSomanrGyalrong-3_stubbornpeoplesayhedoesnotlisten-1	ta- ko kǝ- ŋʂɐŋ	ta- ko kǝ- ŋʂɐŋ	^ t a - + k o + k ǝ - + ŋ ʂ ɐ ŋ $	t a + + k o + k ə + + ŋ ʂ ɐ ŋ
1	MaerkangSomanrGyalrong-413_brothersiblings-1	ka- ʃə ktɐ snɐm	ka- ʃə ktɐ snɐm	^ k a - + ʃ ə + k t ɐ + s n ɐ m $	k a + + ʃ ə + k t ɐ + s n ɐ m
1	MaerkangSomanrGyalrong-559_postponev-1	ka- wa skrɐn	ka- wa skrɐn	^ k a - + w a + s k r ɐ n $	k a + + w a + s k r ɐ n
1	MaerkangSomanrGyalrong-614_pullitoutv-1	ka--ldʐi	ka--ldʐi	^ k a - - l dʐ i $	k a + + l ɖʐ i
1	MaerkangSomanrGyalrong-619_chopupvt-1	ka- ra nʦik	ka- ra nʦik	^ k a - + r a + n ʦ i k $	k a + + r a + n ts i k
1	MaerkangSomanrGyalrong-620_shearv-1	ka- ra nʦik	ka- ra nʦik	^ k a - + r a + n ʦ i k $	k a + + r a + n ts i k
1	MaerkangSomanrGyalrong-622_cuthairvt-1	ka- wʐɐr	ka- wʐɐr	^ k a - + w ʐ ɐ r $	k a + + w ʐ ɐ r
1	MaerkangSomanrGyalrong-872_feather-1	ta- rkʰam	ta- rkʰam	^ t a - + r kʰ a m $	t a + + r kʰ a m
1	MaerkangSomanrGyalrong-934_lazy-1	kə- nə paŋ kɐ	kə- nə paŋ kɐ	^ k ə - + n ə + p a ŋ + k ɐ $	k ə + + n ə + p a ŋ + k ɐ

LinguList commented 3 years ago

@tresoldi, may I ask you to address these issues?

$ cldfbench lexibank_check_prosody naganorgyalrongic

tresoldi commented 3 years ago

Sure.

tresoldi commented 3 years ago

I cannot find the lexibank command. Perhaps you mean lexibank.check_phonotactics, from https://github.com/lexibank/pylexibank/tree/prosody ? But I am already fixing the ones you listed.

LinguList commented 3 years ago

please check the PR in pylexibank!

LinguList commented 3 years ago

and please also adjust the author in CONTRIBUTORS: we should put ourselves now as "Other" not as "Author". So you put hte original author as Author for zenodo...

tresoldi commented 3 years ago

Ok, I reproduced it with lexibank.check_phonotactics, the one I meant. For the CONTRIBUTORS, easy to do.

I cannot find an easy way to solve the problem of multiple markers, however. A few of them can be solved easily with a profile. Sometimes it involves adding similar extra rules (such as making sure that ta- is mapped to t a +, but ta-$ to t a), but it is totally doable.

Still, most cases are arising due to the tokenizer always adding a + as a separator, for which there is an argument but I cannot find a way to override it (here). Notice that the splitting happens before the orthographic profile is applied.

An alternative could be to use spaces in the graphemes of the orthographic profile, such as mapping both "ta-" and "ta-" to t a + (as the second is longer, it would take precedence). However, we cannot use spaces in the graphemes, also due to the splitting above.

An additional alternative would be to call self.tokenizer({}, form) ourselves, removing multiple subsequence + (and, for that matter, leading and trailing ones), but we don't have an .add_value_with_segments() method (which, personally, I think would be a bad thing to have), meaning that we would need to reproduce the FormSpec output as well, passing to .add_form_with_segments() all three: value, form, and segments.

My suggestion, to keep as much backwards compatibility and open room for solving other problems, would be to patch pylexibank.LexibankWriter.tokenize() (here), modifying the list of segments that is return to (a) strip leading and trailing markers and (b) replace multiple subsequent markers with a single one.

If you all agree, I can quickly prepare a PR for that.

LinguList commented 3 years ago

The tokenizer adds a + for each whitespace. To avoid that this is done, you first need to add replacements=[(" ", "_")] to your FormSpec. If you check datasets which I corrected (TNG, PNY, etc.), you will see that I used this solution everwhere. Then you can identify the problematic cases and delete the _.

lexibank / naganorgyalrongic

check the prosody #8