tibetan-nlp / old-tibetan-corpus

Linguistically analyzed Old TIbetan documents and some tools for processing Old Tibetan text
MIT License
5 stars 1 forks source link

substitute rule in OT normalization grammar that will never apply #8

Closed heacu closed 2 years ago

heacu commented 3 years ago

In the OT normalization grammar, the following rule will never apply:

SUBSTITUTE ("ཡི་གེའ(་?)"r) ("ཡི་གེ$1"v) TARGET (σ);

The reason is that tokens in input texts are tsheg bars (hence σ). The rule should be altered to incorporate a prior syllable context.

heacu commented 3 years ago

The same is also true of this rule:

SUBSTITUTE ("ཆེད་པོ((འི|ར|ས)?་)"r) ("ཆེན་པོ$1"v) TARGET (σ);

FChrispz commented 2 years ago

@heacu I tested the grammar with this new rule and now it works:

ད / ན suffix variation Background: The ད / ན suffix variation is another feature of Old Tibetan. Common forms are ཆེད་པོ་ and ཅེད་པོ་ Rule: Normalize ཆེད་པོ་(པོའི་/པོར་/པོས་) and ཅེད་པོ་(པོའི་/པོར་/པོས་) as ཆེན་པོ་(པོའི་/པོར་/པོས་)

SUBSTITUTE ("(ཅེ|ཆེ)(ད|ན)་"r) ("ཆེན་") TARGET (σ) (1 ("((པོ|ཕོ)(འི|ར|ས)?(་?))"r));

FChrispz commented 2 years ago

I also fixed and tested the SUBSTITUTE rule for ཡི་གེའ

SUBSTITUTE ("གེའ(་?)"r) ("གེ$1"v) TARGET (σ) (-1 ("ཡི་"));