Closed aryamccarthy closed 4 years ago
Thank you for reporting this bug, @aryamccarthy! I can reproduce it locally.
This fix 5e2d0ef (on the dev
branch) seems to resolve this issue. This is probably some legacy code that checked for consistency of the CoNLL Shared Task data, which didn't happen to trigger this issue.
For now you can check out the code and apply this patch on master
(we don't recommend using dev
unless you're developing, as it could be unstable and/or contain undocumented changes) then pip install -e .
to use it. We'll include this bugfix in a release in the near future!
Haha, deleting the assert
seems straightforward enough. Thanks for wrapping this up! Hope all is well since our meeting at Disneyland.
This fix is now in our v1.0.1 release. Closing this issue now.
If there's leading punctuation in the string, the Vietnamese tokenizer raises an AssertionError.
To Reproduce Steps to reproduce the behavior:
This gives:
Expected behavior Not crashing.
Environment (please complete the following information):
Additional context The issue stems from lastpred being empty because you haven't ever reached this line: https://github.com/stanfordnlp/stanza/blob/b73fb996b1cc2ea339acf4668f484a9c3e298434/stanza/utils/postprocess_vietnamese_tokenizer_data.py#L29
This issue also affects StanfordNLP.