Issue when tokenize string contains single quote character.

vncorenlp / VnCoreNLP

A Vietnamese natural language processing toolkit (NAACL 2018)

Other

587 stars 145 forks source link

Issue when tokenize string contains single quote character. #42

Closed hoathienvu8x closed 2 years ago

hoathienvu8x commented 2 years ago

Issue when tokenize string contains single quote character. I have a string : "Nghiên cứu Đại học King's College London và Đại học Leiceste đăng trên Tạp chí Di truyền Con người Mỹ năm 2021 cũng cho thấy"

When run with VnCoreNLP result:

[
    "Nghiên",
    "cứu",
    "Đại_học",
    "King",
    "'",
    "s",
    "College_London",
    "và",
    "Đại_học",
    "Leiceste",
    "đăng",
    "trên",
    "Tạp_chí",
    "Di_truyền",
    "Con_người",
    "Mỹ",
    "năm",
    "2021",
    "cũng",
    "cho",
    "thấy"
]

Other sample string: "H'Hen Niê là một hoa hậu và người mẫu người Việt Nam."

result:

[
    "H",
    "'",
    "Hen_Niê",
    "là",
    "một",
    "hoa_hậu",
    "và",
    "người_mẫu",
    "người",
    "Việt_Nam",
    "."
]

datquocnguyen commented 2 years ago

Thanks. The tokenizer is not perfect.

hoathienvu8x commented 2 years ago

@datquocnguyen I think single quote character has been split inside of regex PUNCTUATION, it will be punctuation if space is prefix or suffix

hoathienvu8x commented 2 years ago

--- src/main/java/vn/corenlp/tokenizer/Tokenizer.java   2022-06-29 15:25:05.204820143 +0700
+++ src/main/java/vn/corenlp/tokenizer/Tokenizer.java   2022-06-29 15:25:18.268773361 +0700
@@ -315,6 +315,8 @@
     public static final String NUMBERS_EXPRESSION = NUMBER + "([\\+\\-\\*\\/]" + NUMBER + ")*";

     public static final String SHORT_NAME = "([\\p{L}]+([\\.\\-][\\p{L}]+)+)|([\\p{L}]+-\\d+)";
+    
+    public static final String NAME_HAS_PUNCTUATION = "([\\p{L}]+(['’][\\p{L}]+)+)";

     public static final String WORD_WITH_HYPHEN = "\\p{L}+-\\p{L}+(-\\p{L}+)*";

@@ -360,6 +362,9 @@
             regexes.add(SHORT_NAME);
             regexIndex.add("SHORT_NAME");

+            regexes.add(NAME_HAS_PUNCTUATION);
+            regexIndex.add("NAME_HAS_PUNCTUATION");
+
             regexes.add(NUMBERS_EXPRESSION);
             regexIndex.add("NUMBERS_EXPRESSION");