mesolitica / malaya

Natural Language Toolkit for Malaysian language, https://malaya.readthedocs.io/
MIT License
465 stars 129 forks source link

IndexError: string index out of range in Spelling Correction #136

Closed YuHengKit closed 1 year ago

YuHengKit commented 1 year ago

Hi, face the error while executing the following text and code

`

import malaya
text='Apa kena tah shopee problem'

lm = malaya.language_model.kenlm(model = 'bahasa-wiki-news')
corrector = malaya.spelling_correction.probability.load(language_model = lm)
#stemmer = malaya.stem.deep_model('noisy')
#normalizer = malaya.normalize.normalizer(corrector, stemmer)
normalizer = malaya.normalize.normalizer(corrector)
#https://github.com/huseinzol05/malaya/blob/master/example/normalizer/load-normalizer.ipynb
'''
normalize_elongated: bool, optional (default=True)
        if True, `betuii` -> `betui`.
normalize_text: bool, optional (default=True)
        if True, will try to replace shortforms with internal corpus.        
'''
final_string=normalizer.normalize(text, normalize_elongated = True, normalize_text = True)
print(final_string['normalize'])

`

`


IndexError Traceback (most recent call last)

in 13 if True, will try to replace shortforms with internal corpus. 14 ''' ---> 15 final_string=normalizer.normalize(text, normalize_elongated = True, normalize_text = True) 16 print(final_string['normalize']) 17 result=final_string['normalize'] ~\AppData\Roaming\Python\Python37\site-packages\herpetologist\__init__.py in check(*args, **kwargs) 98 nested_check(v, p) 99 --> 100 return func(*args, **kwargs) 101 102 return check ~\AppData\Roaming\Python\Python37\site-packages\malaya\normalize.py in normalize(self, string, normalize_text, normalize_url, normalize_email, normalize_year, normalize_telephone, normalize_date, normalize_time, normalize_emoji, normalize_elongated, normalize_hingga, normalize_pada_hari_bulan, normalize_fraction, normalize_money, normalize_units, normalize_percent, normalize_ic, normalize_number, normalize_x_kali, normalize_cardinal, normalize_ordinal, normalize_entity, expand_contractions, check_english_func, check_malay_func, translator, language_detection_word, acceptable_language_detection, segmenter, stemmer, **kwargs) 883 word, end_result_string = _remove_postfix(word, stemmer=stemmer) 884 if normalize_text: --> 885 word, repeat = check_repeat(word) 886 else: 887 repeat = 1 ~\AppData\Roaming\Python\Python37\site-packages\malaya\normalize.py in check_repeat(word) 154 155 def check_repeat(word): --> 156 if word[-1].isdigit() and not word[-2].isdigit(): 157 repeat = int(word[-1]) 158 word = word[:-1] IndexError: string index out of range

`

YuHengKit commented 1 year ago
text='Apa kena tah shopee problem'
for word in text.split():
    print(word)
    #print(word[-1])
    if len(word) < 2:
        print('ok')

    if word[-1].isdigit() and not word[-2].isdigit():
        repeat = int(word[-1])
        word = word[:-1]
    else:
        repeat = 1

    if repeat < 1:
        repeat = 1
    print(repeat)

try to reproduce the function manually but it seems no issue.

huseinzol05 commented 1 year ago

What is ur malaya version?

YuHengKit commented 1 year ago

It is 5.0.

huseinzol05 commented 1 year ago

will look into ASAP, got other stuffs need to be done first

huseinzol05 commented 1 year ago

tried abstractive normalization? https://malaya.readthedocs.io/en/stable/load-normalizer-abstractive.html

YuHengKit commented 1 year ago

Will look into this. Just a side question, do you know any English version of this abstractive normalization? As I need to do some comparison.

Thank you

YuHengKit commented 1 year ago

Just a quick update, the issue seem ok now.