undertheseanlp / underthesea

Underthesea - Vietnamese NLP Toolkit
http://undertheseanlp.com
GNU General Public License v3.0
1.37k stars 270 forks source link

Regex inside underthesea does not match entire words only #686

Closed qhungbui7 closed 1 year ago

qhungbui7 commented 1 year ago

I have a script

word_tokenize("chien thanggggg chien thang", use_token_normalize=True, fixed_words=['chien thang'])

The result:

['chien thang', 'gggg', 'chien thang']

I want the tokenizer to make it match entire the word only - my expected output:

['chien', 'thanggggg', 'chien thang']

I have made a change in the source code of tokenize function - add the word boundaries in the logic of regex:

def tokenize(text, format=None, tag=False, use_character_normalize=True, use_token_normalize=True, fixed_words=[]):
    """
    tokenize text for word segmentation

    Args:
        use_token_normalize: use token normalize or not
        use_character_normalize: use character normalize or not
        tag: return token with tag or not
        format: format of result, default is None
    """
    global recompile_regex_patterns
    global patterns
    if len(fixed_words) > 0:
        compiled_fixed_words = [re.sub(' ', '\ ', fixed_word) for fixed_word in fixed_words]
        fixed_words_pattern = "(?P<fixed_words>\\b" + "\\b|\\b".join(compiled_fixed_words) + "\\b)" # fix here, add the \b
        merged_regex_patterns = [fixed_words_pattern] + regex_patterns
        regex_patterns_combine = r"(" + "|".join(merged_regex_patterns) + ")"
        patterns = re.compile(regex_patterns_combine, re.VERBOSE | re.UNICODE)
        recompile_regex_patterns = True
    if use_character_normalize:
        text = normalize_characters_in_text(text)
    matches = [m for m in re.finditer(patterns, text)]
    tokens = [extract_match(m) for m in matches]

    if tag:
        return tokens

    tokens = [token[0] for token in tokens]
    if use_token_normalize:
        tokens = [token_normalize(_, use_character_normalize=use_character_normalize) for _ in tokens]

    if format == "text":
        return " ".join(tokens)

    return tokens

Then it works correctly in my cases, but I still wonder if this change breaks the logic of regex in other cases or not.

Can you guys enable this option for the word tokenizer? thank you, really appreciate your work

rain1024 commented 1 year ago

@qhungbui7 thanks for your feedback. You're absolutely right.

Don't hesitate to create a fresh pull request featuring your updated code along with the appropriate test cases.

rain1024 commented 1 year ago

@qhungbui7 I've just rolled out version 6.4.0 which addresses this error.

Your feedback and suggestions are greatly appreciated, thank you!

Screenshot 2023-07-14 at 10 16 29