taishi-i / nagisa

A Japanese tokenizer based on recurrent neural networks
https://huggingface.co/spaces/taishi-i/nagisa-demo
MIT License
379 stars 22 forks source link

Improving the handling of numerals of nagisa's word tokenizer #9

Closed BLKSerene closed 5 years ago

BLKSerene commented 5 years ago

I'm using nagisa v0.1.1. There's some problems about the tokenizer's handling of numerals, the numbers and decimals are split as single characters and tagged as "名詞" 357 -> 3_名詞 5_名詞 7_名詞 # Numbers 1.48 -> 1_名詞 ._名詞 4_名詞 8_名詞 # Decimals $5.5 -> $_補助記号 5_名詞 ._補助記号 5_名詞 # Numbers with currency symbols (and other symbols) 133-1111-2222 -> 1_名詞 3_名詞 3_名詞 -_補助記号 1_名詞 1_名詞 1_名詞 1_名詞 -_補助記号 2_名詞 2_名詞 2_名詞 2_名詞 # Phone numbers

and etc... Is it possible to improve this?

taishi-i commented 5 years ago

Hi @BLKSerene

The over tokenized problem is caused because: A lot of numerals exist in the training data as a single character with "名詞". The word segmentation and pos-tagging model in nagisa learns such patterns. So, numerals in text are tagged as a single character with "名詞"

Since it is difficult to make modifications to the training data, I recommend using the following post-processing function concat_numeric_chars. This function concatenates continuous numerals and symbols into a single word with "数詞" ("数詞" means numeric in Japanese.) To avoid the over tokenized problem, please try this approach.

import nagisa

def concat_numeric_chars(words, postags, num_postag="数詞"):
    out_words = []                                                      
    out_postags = []                                                    
    substring = []                                                      
    for word, postag in zip(words, postags):                            
        if (word.isnumeric() is True) or (postag == "補助記号") or (word == "."):
            substring.append(word)                                      
        else:                                                           
            if len(substring) > 0:                                      
                out_words.append("".join(substring))                    
                out_postags.append(num_postag)                          
                substring = []                                          
            out_words.append(word)                                      
            out_postags.append(postag)                                  

    if len(substring) > 0:                                              
        out_words.append("".join(substring))                            
        out_postags.append(num_postag)                                  

    return out_words, out_postags                                       

def main():                                                             
    # Numbers                                                           
    text = "357"                                                        
    tokens = nagisa.tagging(text)                                       
    words, postags = concat_numeric_chars(tokens.words, tokens.postags) 
    print(words, postags) #=> ['357'] ['数詞']                          

    # Decimals                                                          
    text = "1.48"                                                       
    tokens = nagisa.tagging(text)                                       
    words, postags = concat_numeric_chars(tokens.words, tokens.postags) 
    print(words, postags) #=> ['1.48'] ['数詞']                         

    # Numbers with currency symbols (and other symbols)                 
    text = "$5.5"                                                       
    tokens = nagisa.tagging(text)                                       
    words, postags = concat_numeric_chars(tokens.words, tokens.postags) 
    print(words, postags) #=> ['$5.5'] ['数詞']                         

    # Phone numbers                                                     
    text = "133-1111-2222"                                              
    tokens = nagisa.tagging(text)                                       
    words, postags = concat_numeric_chars(tokens.words, tokens.postags) 
    print(words, postags) #=> ['133-1111-2222'] ['数詞']                

if __name__ == "__main__":                                              
    main()
BLKSerene commented 5 years ago

Thanks, it works.

taishi-i commented 5 years ago

Thanks for the feedback.

Subrata15 commented 3 years ago

thank you, that's work for me!