stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.31k stars 896 forks source link

Vietnamese tokenizer fails on sentences beginning with punctuation. #217

Closed aryamccarthy closed 4 years ago

aryamccarthy commented 4 years ago

If there's leading punctuation in the string, the Vietnamese tokenizer raises an AssertionError.

To Reproduce Steps to reproduce the behavior:

import stanza
stanza.download(lang="vi")
s = stanza.Pipeline(lang="vi")
s("- tuyệt vời!")

This gives:

Traceback (most recent call last):
    ...
    doc = self.nlp(s)
  File "/Users/arya/anaconda3/envs/staple/lib/python3.7/site-packages/stanza/pipeline/core.py", line 173, in __call__
    doc = self.process(doc)
  File "/Users/arya/anaconda3/envs/staple/lib/python3.7/site-packages/stanza/pipeline/core.py", line 167, in process
    doc = self.processors[processor_name].process(doc)
  File "/Users/arya/anaconda3/envs/staple/lib/python3.7/site-packages/stanza/pipeline/tokenize_processor.py", line 81, in process
    data = paras_to_chunks(text, dummy_labels)
  File "/Users/arya/anaconda3/envs/staple/lib/python3.7/site-packages/stanza/utils/postprocess_vietnamese_tokenizer_data.py", line 38, in paras_to_chunks
    return [para_to_chunks(re.sub('\s', ' ', pt.rstrip()), pc) for pt, pc in zip(text.split('\n\n'), char_level_pred.split('\n\n'))]
  File "/Users/arya/anaconda3/envs/staple/lib/python3.7/site-packages/stanza/utils/postprocess_vietnamese_tokenizer_data.py", line 38, in <listcomp>
    return [para_to_chunks(re.sub('\s', ' ', pt.rstrip()), pc) for pt, pc in zip(text.split('\n\n'), char_level_pred.split('\n\n'))]
  File "/Users/arya/anaconda3/envs/staple/lib/python3.7/site-packages/stanza/utils/postprocess_vietnamese_tokenizer_data.py", line 24, in para_to_chunks
    assert len(lastpred) > 0
AssertionError

Expected behavior Not crashing.

Environment (please complete the following information):

Additional context The issue stems from lastpred being empty because you haven't ever reached this line: https://github.com/stanfordnlp/stanza/blob/b73fb996b1cc2ea339acf4668f484a9c3e298434/stanza/utils/postprocess_vietnamese_tokenizer_data.py#L29

This issue also affects StanfordNLP.

qipeng commented 4 years ago

Thank you for reporting this bug, @aryamccarthy! I can reproduce it locally.

This fix 5e2d0ef (on the dev branch) seems to resolve this issue. This is probably some legacy code that checked for consistency of the CoNLL Shared Task data, which didn't happen to trigger this issue.

For now you can check out the code and apply this patch on master (we don't recommend using dev unless you're developing, as it could be unstable and/or contain undocumented changes) then pip install -e . to use it. We'll include this bugfix in a release in the near future!

aryamccarthy commented 4 years ago

Haha, deleting the assert seems straightforward enough. Thanks for wrapping this up! Hope all is well since our meeting at Disneyland.

yuhaozhang commented 4 years ago

This fix is now in our v1.0.1 release. Closing this issue now.