uhh-lt / amharicprocessor

Amharic Segmenter and tokenizer
MIT License
7 stars 4 forks source link

To review amharicNormalizer.py #2

Open meleayi opened 1 year ago

meleayi commented 1 year ago

Hi there,

I was reviewing the code in the amharicNormalizer.py file and noticed that there's no code to handle a specific scenario like በልቱዋል or በልቱአል to በልቷል . I believe we should add a code snippet to address this issue. Specifically, I propose that we include the following script code to handle this problem:

Normalizing words with Labialized Amharic characters such as በልቱዋል or በልቱአል to በልቷል

    norm=re.sub('(ሉ[ዋአ])','ሏ',norm)
    norm=re.sub('(ሙ[ዋአ])','ሟ',norm)
    norm=re.sub('(ቱ[ዋአ])','ቷ',norm)
    norm=re.sub('(ሩ[ዋአ])','ሯ',norm)
    norm=re.sub('(ሱ[ዋአ])','ሷ',norm)
    norm=re.sub('(ሹ[ዋአ])','ሿ',rep31)
    norm=re.sub('(ቁ[ዋአ])','ቋ',norm)
    norm=re.sub('(ቡ[ዋአ])','ቧ',norm)
    norm=re.sub('(ቹ[ዋአ])','ቿ',norm)
    norm=re.sub('(ሁ[ዋአ])','ኋ',norm)
    norm=re.sub('(ኑ[ዋአ])','ኗ',norm)
    norm=re.sub('(ኙ[ዋአ])','ኟ',norm)
    norm=re.sub('(ኩ[ዋአ])','ኳ',norm)
    norm=re.sub('(ዙ[ዋአ])','ዟ',norm)
    norm=re.sub('(ጉ[ዋአ])','ጓ',norm)
    norm=re.sub('(ደ[ዋአ])','ዷ',norm)
    norm=re.sub('(ጡ[ዋአ])','ጧ',norm)
    norm=re.sub('(ጩ[ዋአ])','ጯ',norm)
    norm=re.sub('(ጹ[ዋአ])','ጿ',norm)
    norm=re.sub('(ፉ[ዋአ])','ፏ',norm)
    norm=re.sub('[ቊ]','ቁ',rep46) #ቁ can be written as ቊ
    norm=re.sub('[ኵ]','ኩ',norm) #ኩ can be also written as ኵ  

This should help ensure that the script runs smoothly and prevents errors.

Regards, Melese.