lexibank / pylexibank

The python curation library for lexibank
Apache License 2.0
18 stars 7 forks source link

Check for two consecutive tones #215

Closed tresoldi closed 3 years ago

tresoldi commented 4 years ago

If a profile does not properly account for tonal annotations that use multiple numbers (such as ³⁵ or ²¹⁴, i.e., mostly in the case of Chao numbers vs. tone numbers), we might end up with multiple subsequent tones in segments (like m a ³ ⁵ instead of m a ³⁵). I'm guilty of this, particularly for datasets that mix Chao and tone numbers.

This should be a phonotactics check. I can implement this, but we should decide how. Without being too clever, one alternative would be to join the SCA for the segments in a single string (something like " ".join([sca[token] for token in segments]) -- pseudo code) and looking for T T.

LinguList commented 4 years ago

This check is part of the structure test (automatically), which we will try to implement in sinopy. I see two possibilities:

  1. implement sinopy checks in lexibank (maybe too much, because sinopy is still heavy and untested)
  2. recommend to test with sinopy in specific situations
  3. half-way testing: arrange simple phonotactic sanity tests based on sound classes (which is all in place with our dependencies) and encourage testing with sinopy in pure SEA language datasets with morphological segmentation and the typical one-morpheme=one-syllable structure
LinguList commented 4 years ago

Okay, three possibilities ;)

tresoldi commented 4 years ago

Maybe sinopy checking could be a per-dataset flag?

xrotwang commented 4 years ago

I think I'd prefer checking with sinopy, iff this can be considered an area or documentation-tradition specific issue. I.e. sinopy would be considered a documentation-tradition specific support package.

Possibly, sinopy might at some point provide other functionality used in the dataset's makecldf command as well?

xrotwang commented 4 years ago

I can have a go at adding more tests to sinopy, if that's ok. Is the API somewhat stable by now?

LinguList commented 4 years ago

Possibly, sinopy might at some point provide other functionality used in the dataset's makecldf command as well?

Yes, I have a preliminary profile creation that would also provide structural data in the forms.csv, but has so far not been used.

tresoldi commented 4 years ago

It makes sense, sinopy could even incorporate other EA (ie., non sino) checks. For sidwellpalaungic, for example, the (C1ă).ˈCi(Cm)V(ː)(Cf)T pattern.

But just fixing my mistakes is enough. :wink: I will fix them by just grepping the forms.csv for the time being.

LinguList commented 4 years ago

And yes, the API is somewhat stable, but the problem is that there is no separation of data and code, so the code loads a lot of data. I think this would require a framework where data is placed somewhere else (e.g., in cldf). But if you have a quick look and give me your impression, we can see how much time it would cost to enhance it (or break out a sub-part of sinopy to start with).

LinguList commented 4 years ago

Major functionality is this simple function:

https://github.com/lingpy/sinopy/blob/e8a2fa57fc6940d9c27777c85dcaaf3a4dd11164/src/sinopy/sinopy.py#L402

It guesses the phonotactics of a morpheme (initial, medial, nucleus, coda, tone).

xrotwang commented 4 years ago

Ah, yes. I think refactoring sinopy would be a good idea. Just tried running the tests and it fails because pyburmish is imported, too. So better refactor before this gets a big tangled mess.

LinguList commented 4 years ago

Okay, let me try and have a go on this today. I'll try to extract core functions and kick out the stuff we don't need.

LinguList commented 3 years ago

I think we can really outsource this issue to linse or sinopy, so we can ignore it here.