A bug when using word_tokenize

baominhlt commented 11 months ago

Hi undertheseanlp, Your work and passion are great. I'm your big fan and I have been using underthesea package a lot of my times when doing a NLP project. Recently, I have a bug when using the word_tokenize module. The detail of this bug is: >>> word_tokenize("Phổ là bang lớn nhất và mạnh nhất trong Liên bang Đức (chiếm 61% dân số và 64% lãnh thổ).") => The result is: ['Phổ', 'là', 'bang', 'lớn', 'nhất', 'và', 'mạnh', 'nhất', 'trong', 'Liên bang', 'Đức', '(', 'chiếm', '61 %', 'dân số', 'và', '64', '%', 'lãnh thổ', ')', '.'] But, when I change the 61% into another number, this bug is not happened. Another case is: >>> word_tokenize("Thời Trần, những người đứng đầu xã được gọi là Xã quan.") => The result is: ['Thời', 'Trần ,', 'những', 'người', 'đứng', 'đầu', 'xã', 'được', 'gọi là', 'Xã quan', '.'] The version of underthesea:

Name: underthesea
Version: 6.5.0
Summary: Vietnamese NLP Toolkit
Home-page: https://github.com/undertheseanlp/underthesea
Author: Vu Anh
Author-email: anhv.ict91@gmail.com
License: GNU General Public License v3
Location: /home/minhltb/miniconda3/envs/nlp/lib/python3.9/site-packages
Requires: Click, joblib, nltk, python-crfsuite, PyYAML, requests, scikit-learn, tqdm, underthesea-core
Required-by:

I will try my best to support your work. Sincerely, baominhlt

rain1024 commented 11 months ago

@baominhlt Thank you for pointing that out. I've reviewed the training data and discovered inconsistencies with punctuation labels. I'll correct these errors, retrain the model, and release an updated version to address the issue.

baominhlt commented 11 months ago

I am glad to be able to help you. I am looking forward to using the latest update of underthesea.

rain1024 commented 11 months ago

@baominhlt I'm updating underthesea with the latest version, still utilizing the VLSP2013 Word Tokenize dataset. I've swapped out wt_crf_2018_09_13.bin for ws_crf_vlsp2013_20230727. It's incredible to see the progress over 5 years. I'll be launching underthesea version 6.6.0 shortly. Stay updated!

Update 2023-07-23: I've released version 6.6.0 of Underthesea. It seems to work well for your two sentences.

However, I'm not certain if it resolves all issues related to punctuation. Please test it out and provide feedback on any problems you encounter.

Thank you very much for your feedback.

undertheseanlp / underthesea

A bug when using word_tokenize #696