Open eroux opened 3 years ago
@eroux do you want to have write access to the repo so that you can make changes yourself if you and @FChrispz are on the same page?
hmm, the main problem is that I don't know how to test the changes I would make so I'd be uncomfortable changing things I can't test. I'm just copying the rules to the lucene analyzer where I can test them
For me it is the same, I can make the changes or @eroux can make them, once we agree on the modification of the rule. @eroux same with གནའི་ - is that correct? In any case I think you are right.
ah indeed, this is also a good example!
sorry, reading the rule again, གནའི works with the current rule, because it needs 3 characters before the final consonant with a vowel and there are only 2 in this case
@eroux I restricted the cases for the general rule - I am excluding some (maybe we need to exclude all?) subjoined letters in Unicode - in the Tibetan script there are only yata,rata and lata but in Unicode every consonant has a subjoined version and the Tibetan Unicode stack letters from the top to the bottom regardless if it is a root consonant or not. So far I excluded "tsa", "ga" and "da" - according to the cases found in the, OTA, OTC and Ramayana texts - here is the improved rule:
Merged Syllables (Generic Rule) Background: Traditionally in Classical Tibetan, syllables are separated by a tsheg. In Old Tibetan texts, syllable margines are not so clear and often a syllable (verb, noun and so on) is merged together with the following case marker or converb (For example: སྟགི > སྟག་གི, དུསུ > དུས་སུ, བཀུམོ > བཀུམ་མོ). It is possible to split the syllables using three regular expressions - we have to apply the most specific ones first and, at the end, the generic rule. Rule: Split merged syllables apllying the following regular expression:([^aeiouI\s]+[aeiouI][^aeiouI\s])([^aeiouI\s'])([aeiouI][^aeiouI\s']) > $1$2 $2$3
in general the syllable will be: {2-6}C + C (not yata/rata/lata/wa subj "tsa"/subj "ga"/ subj "da") + V - Ex: bsgrubso - bsgrubs so
Examples: no subjoined tsa \u0fa9: we don't want བརྩེ to be splitted; no subjoined ga \u0f92: we don't want བསྒོ་ to be splitted; no subjoined la \u0fb3: we don't want བཟླུ to be splitted we need to exclude "" because it can occur at the beginnig of a syllable to keep track of editorial changes from the original text (only needed for Ramayana text so far) SPLITCOHORT ( "<$1>"v "$1$3་"v σ "<$3$4>"v "$3$4"v σ )("<(.{2,6}^)(([^\u0FB2\u0FB1\\u0fa9\u0f92\u0fb3\u0FA1])([\u0F7C\u0F7A\u0F74\u0F72\u0F80]་?))>"r)(NOT 0 (split) or (genitive) or (diphthongs));
@eroux At the moment I exclude only subjoined tsa/ga/da + (yata/rata/lata). It might be correct to exclude all the subjoined letters that can occur in Unicode but I prefer to be conservative in these regards to avoid to capture unwanted cases.
I think it matches a bit too much, for instance it will transform བསྒའི into བསྒའ་འི which is not great... I think changing it to
will make it safer