Old Tibetan gi -> g gi issue

tibetan-nlp / old-tibetan-corpus

Linguistically analyzed Old TIbetan documents and some tools for processing Old Tibetan text

MIT License

5 stars 1 forks source link

Old Tibetan gi -> g gi issue #4

Closed eroux closed 2 years ago

eroux commented 3 years ago

in:

SPLITCOHORT (
  "<$1>"v "$1ག་"v σ
  "<$2>"v "ག$4"v σ
)("<(.)((\\u0F42)([\\u0F72\\u0F80]་?))>"r)

that means you're also going to split the following:

དགི* -> དག་གི*
བགི* -> བག་གི*
མགི* -> མག་གི*
འགི* -> འག་གི*

but in these 4 cases it can be analyzed as

prefix (ད, བ, མ or འ)
main letter ga
gigu

instead of the contraction... so maybe

SPLITCOHORT (
  "<$1>"v "$1ག་"v σ
  "<$2>"v "ག$4"v σ
)("<([^དབམའ])((\\u0F42)([\\u0F72\\u0F80]་?))>"r)

could be more accurate? Or maybe in some cases it's more likely to be a contraction ? wdyt?

FChrispz commented 3 years ago

@eroux what do you mean by "contraction"?

eroux commented 3 years ago

I mean the case for which the rule has been done, like པགི་ > པག་གི་

eroux commented 3 years ago

to be a bit more explicit: པགི་ cannot be "normal" Tibetan, it's necessarily པག་གི་, so the rule works in that case. OTOH, དགི can be regular Tibetan and doesn't necessarily represent དག་གི, and I'm not sure it's a good idea to apply the rule in that case. (I'm also not sure it's a bad idea)

FChrispz commented 3 years ago

both དགི and འགི can be regular Tibetan. Let me have a deeper look at it and I will let you know.

eroux commented 3 years ago

yes, དགི and འགི (and བགི and མགི) can be regular Tibetan, that's why I think the rule should be adjusted. The rule works on པགི་ though, which cannot be regular Tibetan

FChrispz commented 2 years ago

Hi @eroux I am back on finalizing the normalisation grammar - atm I am working on the OT Ramayana and I added new rules to take care of some cases in the new text. I want to solve the issues that you raised as well. For this contractions with the genitive. བགི and མགི cannot be regular Tibetan. དགི and འགི can - in all OTDO there are no cases of དགི and few cases of འགི which I don't think are contractions (and do not appear in our texts). I would modify the rule as follow:

SPLITCOHORT ( "<$1>"v "$1ག་"v σ "<$2>"v "ག$4"v σ )("<([^དའ])((\u0F42)([\u0F72\u0F80]་?))>"r)

What do you think?

eroux commented 2 years ago

looks good, thanks!