undertheseanlp / underthesea

Underthesea - Vietnamese NLP Toolkit
http://undertheseanlp.com
GNU General Public License v3.0
1.37k stars 270 forks source link

Panic in word_tokenize #684

Closed yc-lim-boop closed 1 year ago

yc-lim-boop commented 1 year ago

underthesea version: 6.2.0

text_normalize inserts a space for "Đaị", which causes word_tokenize to panic:

>>> underthesea.text_normalize('đaị')
'đại '
>>> underthesea.word_tokenize('đaị')
['đại ']
>>> underthesea.text_normalize('Đaị')
'Đại '
>>> underthesea.text_normalize('Đaị Việt')
'Đại  Việt'
>>> underthesea.word_tokenize('Đaị')
thread '<unnamed>' panicked at 'called `Option::unwrap()` on a `None` value', src/featurizers.rs:151:63
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/envs/underthesea/lib/python3.8/site-packages/underthesea/pipeline/word_tokenize/__init__.py", line 36, in word_tokenize
    output = crf_model.predict(tokens, format)
  File "/home/user/envs/underthesea/lib/python3.8/site-packages/underthesea/pipeline/word_tokenize/model.py", line 50, in predict
    x = crf_featurizer.process([tokens])[0]
pyo3_runtime.PanicException: called `Option::unwrap()` on a `None` value

The space seems to be inserted by the token_map in text_normalizer. Is this the expected behaviour?

>>> underthesea.pipeline.text_normalize.text_normalizer.token_map['Đaị']
'Đại '
>>> underthesea.pipeline.text_normalize.text_normalizer.token_map['đaị']
'đại '

As a workaround, if whitespace is stripped from the tokens returned by text_normalize, word_tokenize works:

from underthesea.pipeline.word_tokenize.regex_tokenize import tokenize
from underthesea.pipeline.word_tokenize.model import CRFModel

def word_tokenize_fixed(sentence, format=None, use_token_normalize=True, fixed_words=[]):
    # modified from `underthesea.word_tokenize`
    tokens = tokenize(sentence, use_token_normalize=use_token_normalize, fixed_words=fixed_words)
    tokens = [token.strip() for token in tokens]   # Added line
    crf_model = CRFModel.instance()
    output = crf_model.predict(tokens, format)
    tokens = [token[0] for token in output]
    tags = [token[1] for token in output]
    output = []
    num_words = 0
    for tag, token in zip(tags, tokens):
        if tag == "I-W" and num_words > 0:
            output[-1] = output[-1] + u" " + token
        else:
            output.append(token)
        num_words += 1
    if format == "text":
        output = u" ".join([item.replace(" ", "_") for item in output])
    return output
>>> word_tokenize_fixed('đaị')
['đại']
>>> word_tokenize_fixed('Đaị')
['Đại']
qhungbui7 commented 1 year ago

Getting the same error, have you found a way to solve it?

yc-lim-boop commented 1 year ago

Currently I'm just implementing my own version of word_tokenize as shown above, that strips whitespace from the initial tokens.

qhungbui7 commented 1 year ago

My case has a little bit different from yours, this error emerges when I attempt to pass a fixed vocabulary, and I am looking for a "legal" solution but yeah, the error is solved when I strip my vocabulary.

rain1024 commented 1 year ago

@qhungbui7 @yc-lim-boop I've just rolled out version 6.5.0 which addresses this error.

Thank you guys so much for your feedback and suggestions.

image