tibetan-nlp / old-tibetan-corpus

Linguistically analyzed Old TIbetan documents and some tools for processing Old Tibetan text
MIT License
5 stars 1 forks source link

issue with rule 4 of merged syllables #6

Open eroux opened 3 years ago

eroux commented 3 years ago
SPLITCOHORT (
  "<$1>"v "$1$3་"v σ
  "<$3$4>"v "$3$4"v σ
)("<(.{2,6})(([^\\u0FB2\\u0FB1])([\\u0F7C\\u0F7A\\u0F74\\u0F72\\u0F80]་?))>"r)(NOT 0 (split) or (genitive) or (diphthongs));

I think it matches a bit too much, for instance it will transform བསྒའི into བསྒའ་འི which is not great... I think changing it to

SPLITCOHORT (
  "<$1>"v "$1$3་"v σ
  "<$3$4>"v "$3$4"v σ
)("<(.{2,6})(([\\u0F40-\\u0F5F\\u0F61-\\u0F6A])([\\u0F7C\\u0F7A\\u0F74\\u0F72\\u0F80]་?))>"r)(NOT 0 (split) or (genitive) or (diphthongs));

will make it safer

heacu commented 3 years ago

@eroux do you want to have write access to the repo so that you can make changes yourself if you and @FChrispz are on the same page?

eroux commented 3 years ago

hmm, the main problem is that I don't know how to test the changes I would make so I'd be uncomfortable changing things I can't test. I'm just copying the rules to the lucene analyzer where I can test them

FChrispz commented 3 years ago

For me it is the same, I can make the changes or @eroux can make them, once we agree on the modification of the rule. @eroux same with གནའི་ - is that correct? In any case I think you are right.

eroux commented 3 years ago

ah indeed, this is also a good example!

eroux commented 3 years ago

sorry, reading the rule again, གནའི works with the current rule, because it needs 3 characters before the final consonant with a vowel and there are only 2 in this case

FChrispz commented 2 years ago

@eroux I restricted the cases for the general rule - I am excluding some (maybe we need to exclude all?) subjoined letters in Unicode - in the Tibetan script there are only yata,rata and lata but in Unicode every consonant has a subjoined version and the Tibetan Unicode stack letters from the top to the bottom regardless if it is a root consonant or not. So far I excluded "tsa", "ga" and "da" - according to the cases found in the, OTA, OTC and Ramayana texts - here is the improved rule:

Merged Syllables (Generic Rule) Background: Traditionally in Classical Tibetan, syllables are separated by a tsheg. In Old Tibetan texts, syllable margines are not so clear and often a syllable (verb, noun and so on) is merged together with the following case marker or converb (For example: སྟགི > སྟག་གི, དུསུ > དུས་སུ, བཀུམོ > བཀུམ་མོ). It is possible to split the syllables using three regular expressions - we have to apply the most specific ones first and, at the end, the generic rule. Rule: Split merged syllables apllying the following regular expression:([^aeiouI\s]+[aeiouI][^aeiouI\s])([^aeiouI\s'])([aeiouI][^aeiouI\s']) > $1$2 $2$3

in general the syllable will be: {2-6}C + C (not yata/rata/lata/wa subj "tsa"/subj "ga"/ subj "da") + V - Ex: bsgrubso - bsgrubs so

Examples: no subjoined tsa \u0fa9: we don't want བརྩེ to be splitted; no subjoined ga \u0f92: we don't want བསྒོ་ to be splitted; no subjoined la \u0fb3: we don't want བཟླུ to be splitted we need to exclude "" because it can occur at the beginnig of a syllable to keep track of editorial changes from the original text (only needed for Ramayana text so far) SPLITCOHORT ( "<$1>"v "$1$3་"v σ "<$3$4>"v "$3$4"v σ )("<(.{2,6}^)(([^\u0FB2\u0FB1\\u0fa9\u0f92\u0fb3\u0FA1])([\u0F7C\u0F7A\u0F74\u0F72\u0F80]་?))>"r)(NOT 0 (split) or (genitive) or (diphthongs));

FChrispz commented 2 years ago

@eroux At the moment I exclude only subjoined tsa/ga/da + (yata/rata/lata). It might be correct to exclude all the subjoined letters that can occur in Unicode but I prefer to be conservative in these regards to avoid to capture unwanted cases.