Open LinguList opened 3 years ago
Type | ID | Value | Form | Graphemes | Segments |
---|---|---|---|---|---|
1 | LixianXuechengrGyalrong-1100_third-1 | kə- suaŋ | kə- suaŋ | ^ k ə - + s u a ŋ $ | k ə + + s u a ŋ |
1 | LixianXuechengrGyalrong-82_voidfecesvt-1 | tə- pʦʰi kə-lɛ | tə- pʦʰi kə-lɛ | ^ t ə - + p ʦʰ i + k ə - l ɛ $ | t ə + + p tsʰ i + k ə + l ɛ |
1 | LixianXuechengrGyalrong-83_voidfecesvi-1 | tə- pʦʰi da kʂɿ | tə- pʦʰi da kʂɿ | ^ t ə - + p ʦʰ i + d a + k ʂ ɿ $ | t ə + + p tsʰ i + d a + k ʂ z̩ |
1 | MaerkangBolarGyalrongA-1185_already-1 | ... ɕes44 | ... ɕes44 | ^ . . . + ɕ e s 44 $ | + ɕ e s ⁴⁴ |
1 | MaerkangBolarGyalrongA-1188_to-1 | -i | -i | ^ - i $ | + i |
1 | MaerkangBolarGyalrongA-1192_in-1 | -i | -i | ^ - i $ | + i |
1 | MaerkangBolarGyalrongA-1193_aton-1 | -i | -i | ^ - i $ | + i |
1 | MaerkangBolarGyalrongA-1196_of-1 | [A] wu-[B] | wu- | ^ w u - $ | w u + |
1 | MaerkangBolarGyalrongA-1200_only-1 | ... zɨ22 me44 | ... zɨ22 me44 | ^ . . . + z ɨ 22 + m e 44 $ | + z ɨ ²² + m e ⁴⁴ |
1 | MaerkangBolarGyalrongA-1201_exceptfor-1 | ... zɨ22 me44 | ... zɨ22 me44 | ^ . . . + z ɨ 22 + m e 44 $ | + z ɨ ²² + m e ⁴⁴ |
1 | MaerkangBolarGyalrongA-1202_only-1 | ... zɨ22 me44 | ... zɨ22 me44 | ^ . . . + z ɨ 22 + m e 44 $ | + z ɨ ²² + m e ⁴⁴ |
1 | MaerkangBolarGyalrongA-1222_becomev-1 | ... ta44 pa22 o44 | ... ta44 pa22 o44 | ^ . . . + t a 44 + p a 22 + o 44 $ | + t a ⁴⁴ + p a ²² + o ⁴⁴ |
1 | MaerkangBolarGyalrongA-1224_becomev-1 | ... ta44 pa22 o44 | ... ta44 pa22 o44 | ^ . . . + t a 44 + p a 22 + o 44 $ | + t a ⁴⁴ + p a ²² + o ⁴⁴ |
1 | MaerkangBolarGyalrongB-1188_to-1 | -j | -j | ^ - j $ | + j |
1 | MaerkangBolarGyalrongB-1197_also-1 | -j | -j | ^ - j $ | + j |
1 | MaerkangCaodengrGyalrong-1201_exceptfor-1 | ...kǝ44 ma22 … | ...kǝ44 ma22 … | ^ . . . k ǝ 44 + m a 22 + … $ | k ə ⁴⁴ + m a ²² + |
1 | MaerkangJaphugrGyalrong-1253_goodbye-1 | sɤrma je ! | sɤrma je ! | ^ s ɤ r m a + j e + ! $ | s ɤ r m a + j e + |
1 | MaerkangSomanrGyalrong-104_skin-1 | tǝ- | tǝ- | ^ t ǝ - $ | t ə + |
1 | MaerkangSomanrGyalrong-1081_seventy-1 | kə- ʃnəs ʃʧᴇ | kə- ʃnəs ʃʧᴇ | ^ k ə - + ʃ n ə s + ʃ ʧ ᴇ $ | k ə + + ʃ n ə s + ʃ tʃ ɛ |
1 | MaerkangSomanrGyalrong-146_flax-1 | ta- sa | ta- sa | ^ t a - + s a $ | t a + + s a |
1 | MaerkangSomanrGyalrong-148_fur-1 | tə- | tə- | ^ t ə - $ | t ə + |
1 | MaerkangSomanrGyalrong-149_tannedleather-1 | tə- | tə- | ^ t ə - $ | t ə + |
1 | MaerkangSomanrGyalrong-203_getboiledvi-1 | kə- sʦo | kə- sʦo | ^ k ə - + s ʦ o $ | k ə + + s ts o |
1 | MaerkangSomanrGyalrong-212_suckvt-1 | ka- mə sʨup | ka- mə sʨup | ^ k a - + m ə + s ʨ u p $ | k a + + m ə + s tɕ u p |
1 | MaerkangSomanrGyalrong-284_closevi-1 | ka- ʧat | ka- ʧat | ^ k a - + ʧ a t $ | k a + + tʃ a t |
1 | MaerkangSomanrGyalrong-3_stubbornpeoplesayhedoesnotlisten-1 | ta- ko kǝ- ŋʂɐŋ | ta- ko kǝ- ŋʂɐŋ | ^ t a - + k o + k ǝ - + ŋ ʂ ɐ ŋ $ | t a + + k o + k ə + + ŋ ʂ ɐ ŋ |
1 | MaerkangSomanrGyalrong-413_brothersiblings-1 | ka- ʃə ktɐ snɐm | ka- ʃə ktɐ snɐm | ^ k a - + ʃ ə + k t ɐ + s n ɐ m $ | k a + + ʃ ə + k t ɐ + s n ɐ m |
1 | MaerkangSomanrGyalrong-559_postponev-1 | ka- wa skrɐn | ka- wa skrɐn | ^ k a - + w a + s k r ɐ n $ | k a + + w a + s k r ɐ n |
1 | MaerkangSomanrGyalrong-614_pullitoutv-1 | ka--ldʐi | ka--ldʐi | ^ k a - - l dʐ i $ | k a + + l ɖʐ i |
1 | MaerkangSomanrGyalrong-619_chopupvt-1 | ka- ra nʦik | ka- ra nʦik | ^ k a - + r a + n ʦ i k $ | k a + + r a + n ts i k |
1 | MaerkangSomanrGyalrong-620_shearv-1 | ka- ra nʦik | ka- ra nʦik | ^ k a - + r a + n ʦ i k $ | k a + + r a + n ts i k |
1 | MaerkangSomanrGyalrong-622_cuthairvt-1 | ka- wʐɐr | ka- wʐɐr | ^ k a - + w ʐ ɐ r $ | k a + + w ʐ ɐ r |
1 | MaerkangSomanrGyalrong-872_feather-1 | ta- rkʰam | ta- rkʰam | ^ t a - + r kʰ a m $ | t a + + r kʰ a m |
1 | MaerkangSomanrGyalrong-934_lazy-1 | kə- nə paŋ kɐ | kə- nə paŋ kɐ | ^ k ə - + n ə + p a ŋ + k ɐ $ | k ə + + n ə + p a ŋ + k ɐ |
@tresoldi, may I ask you to address these issues?
$ cldfbench lexibank_check_prosody naganorgyalrongic
Sure.
I cannot find the lexibank command. Perhaps you mean lexibank.check_phonotactics
, from https://github.com/lexibank/pylexibank/tree/prosody ? But I am already fixing the ones you listed.
please check the PR in pylexibank!
and please also adjust the author in CONTRIBUTORS: we should put ourselves now as "Other" not as "Author". So you put hte original author as Author for zenodo...
Ok, I reproduced it with lexibank.check_phonotactics
, the one I meant. For the CONTRIBUTORS, easy to do.
I cannot find an easy way to solve the problem of multiple markers, however. A few of them can be solved easily with a profile. Sometimes it involves adding similar extra rules (such as making sure that ta-
is mapped to t a +
, but ta-$
to t a
), but it is totally doable.
Still, most cases are arising due to the tokenizer always adding a +
as a separator, for which there is an argument but I cannot find a way to override it (here). Notice that the splitting happens before the orthographic profile is applied.
An alternative could be to use spaces in the graphemes of the orthographic profile, such as mapping both "ta-"
and "ta-"
to t a +
(as the second is longer, it would take precedence). However, we cannot use spaces in the graphemes, also due to the splitting above.
An additional alternative would be to call self.tokenizer({}, form)
ourselves, removing multiple subsequence +
(and, for that matter, leading and trailing ones), but we don't have an .add_value_with_segments()
method (which, personally, I think would be a bad thing to have), meaning that we would need to reproduce the FormSpec
output as well, passing to .add_form_with_segments()
all three: value
, form
, and segments
.
My suggestion, to keep as much backwards compatibility and open room for solving other problems, would be to patch pylexibank.LexibankWriter.tokenize()
(here), modifying the list of segments that is return to (a) strip leading and trailing markers and (b) replace multiple subsequent markers with a single one.
If you all agree, I can quickly prepare a PR for that.
The tokenizer adds a +
for each whitespace. To avoid that this is done, you first need to add replacements=[(" ", "_")]
to your FormSpec. If you check datasets which I corrected (TNG, PNY, etc.), you will see that I used this solution everwhere. Then you can identify the problematic cases and delete the _
.