Closed BLKSerene closed 5 years ago
Hi @BLKSerene
The over tokenized problem is caused because: A lot of numerals exist in the training data as a single character with "名詞". The word segmentation and pos-tagging model in nagisa learns such patterns. So, numerals in text are tagged as a single character with "名詞"
Since it is difficult to make modifications to the training data, I recommend using the following post-processing function concat_numeric_chars
. This function concatenates continuous numerals and symbols into a single word with "数詞" ("数詞" means numeric in Japanese.) To avoid the over tokenized problem, please try this approach.
import nagisa
def concat_numeric_chars(words, postags, num_postag="数詞"):
out_words = []
out_postags = []
substring = []
for word, postag in zip(words, postags):
if (word.isnumeric() is True) or (postag == "補助記号") or (word == "."):
substring.append(word)
else:
if len(substring) > 0:
out_words.append("".join(substring))
out_postags.append(num_postag)
substring = []
out_words.append(word)
out_postags.append(postag)
if len(substring) > 0:
out_words.append("".join(substring))
out_postags.append(num_postag)
return out_words, out_postags
def main():
# Numbers
text = "357"
tokens = nagisa.tagging(text)
words, postags = concat_numeric_chars(tokens.words, tokens.postags)
print(words, postags) #=> ['357'] ['数詞']
# Decimals
text = "1.48"
tokens = nagisa.tagging(text)
words, postags = concat_numeric_chars(tokens.words, tokens.postags)
print(words, postags) #=> ['1.48'] ['数詞']
# Numbers with currency symbols (and other symbols)
text = "$5.5"
tokens = nagisa.tagging(text)
words, postags = concat_numeric_chars(tokens.words, tokens.postags)
print(words, postags) #=> ['$5.5'] ['数詞']
# Phone numbers
text = "133-1111-2222"
tokens = nagisa.tagging(text)
words, postags = concat_numeric_chars(tokens.words, tokens.postags)
print(words, postags) #=> ['133-1111-2222'] ['数詞']
if __name__ == "__main__":
main()
Thanks, it works.
Thanks for the feedback.
thank you, that's work for me!
I'm using nagisa v0.1.1. There's some problems about the tokenizer's handling of numerals, the numbers and decimals are split as single characters and tagged as "名詞" 357 -> 3_名詞 5_名詞 7_名詞 # Numbers 1.48 -> 1_名詞 ._名詞 4_名詞 8_名詞 # Decimals $5.5 -> $_補助記号 5_名詞 ._補助記号 5_名詞 # Numbers with currency symbols (and other symbols) 133-1111-2222 -> 1_名詞 3_名詞 3_名詞 -_補助記号 1_名詞 1_名詞 1_名詞 1_名詞 -_補助記号 2_名詞 2_名詞 2_名詞 2_名詞 # Phone numbers
and etc... Is it possible to improve this?