Closed tresoldi closed 3 years ago
This check is part of the structure test (automatically), which we will try to implement in sinopy. I see two possibilities:
Okay, three possibilities ;)
Maybe sinopy
checking could be a per-dataset flag?
I think I'd prefer checking with sinopy
, iff this can be considered an area or documentation-tradition specific issue. I.e. sinopy
would be considered a documentation-tradition specific support package.
Possibly, sinopy
might at some point provide other functionality used in the dataset's makecldf
command as well?
I can have a go at adding more tests to sinopy
, if that's ok. Is the API somewhat stable by now?
Possibly, sinopy might at some point provide other functionality used in the dataset's makecldf command as well?
Yes, I have a preliminary profile creation that would also provide structural data in the forms.csv, but has so far not been used.
It makes sense, sinopy
could even incorporate other EA (ie., non sino) checks. For sidwellpalaungic
, for example, the (C1ă).ˈCi(Cm)V(ː)(Cf)T
pattern.
But just fixing my mistakes is enough. :wink: I will fix them by just grepping the forms.csv
for the time being.
And yes, the API is somewhat stable, but the problem is that there is no separation of data and code, so the code loads a lot of data. I think this would require a framework where data is placed somewhere else (e.g., in cldf). But if you have a quick look and give me your impression, we can see how much time it would cost to enhance it (or break out a sub-part of sinopy to start with).
Major functionality is this simple function:
It guesses the phonotactics of a morpheme (initial, medial, nucleus, coda, tone).
Ah, yes. I think refactoring sinopy would be a good idea. Just tried running the tests and it fails because pyburmish
is imported, too. So better refactor before this gets a big tangled mess.
Okay, let me try and have a go on this today. I'll try to extract core functions and kick out the stuff we don't need.
I think we can really outsource this issue to linse or sinopy, so we can ignore it here.
If a profile does not properly account for tonal annotations that use multiple numbers (such as
³⁵
or²¹⁴
, i.e., mostly in the case of Chao numbers vs. tone numbers), we might end up with multiple subsequent tones in segments (likem a ³ ⁵
instead ofm a ³⁵
). I'm guilty of this, particularly for datasets that mix Chao and tone numbers.This should be a phonotactics check. I can implement this, but we should decide how. Without being too clever, one alternative would be to join the SCA for the segments in a single string (something like
" ".join([sca[token] for token in segments])
-- pseudo code) and looking forT T
.