taishi-i / nagisa

A Japanese tokenizer based on recurrent neural networks
https://huggingface.co/spaces/taishi-i/nagisa-demo
MIT License
379 stars 22 forks source link

Nagisa changes Japanese zenkaku to hankaku #36

Closed OishiUnagi closed 2 months ago

OishiUnagi commented 2 months ago

Hi Taishi-i. When I use Nagisa to tokenize Japanese text. It auto changes zenkaku symbols to hankaku. For example: "(" → "(" ")" → ")" "〜" → "~" Can you guide me how to remain zenkaku after tokenizing Japanese text.

taishi-i commented 2 months ago

Hi @OishiUnagi ! Thank you for using nagisa. The conversion of zenkaku symbols to hankaku is due to the use of unicodedata.normalize('NFKC', text), and this is a hardcoded feature. Therefore, it is not currently possible to retain specific zenkaku symbols. I apologize for the inconvenience.

This includes such preprocessing features in nagisa.

import unicodedata

text = "("
normalized_text = unicodedata.normalize('NFKC', text)
print(normalized_text)
# (

However, by combining preprocessing and postprocessing, it is possible to retain zenkaku symbols.

import nagisa

text = "これは(サンプル)です〜。これは(サンプル)です~。"

chars_to_placeholders = {"(": "ZENKAKUA", ")": "ZENKAKUB"}

for char, placeholder in chars_to_placeholders.items():
    text = text.replace(char, placeholder)

words = nagisa.tagging(text)
words_list = words.words

processed_words = [w for w in words_list]
for char, placeholder in chars_to_placeholders.items():
    processed_words = [w.replace(placeholder, char) for w in processed_words]

print(processed_words)

I will share a generalized version of the above code, so please wait a moment.

taishi-i commented 2 months ago

Hi @OishiUnagi! You can use the following class (WrappedNagisa) to output characters that you want to keep unchanged.

As for how to use the code, first define the set of characters you want to keep unchanged with targets = ["(", ")", "〜"]. Then, define wrapped_nagisa = WrappedNagisa(targets) and execute the tokenization and tagging.

Please note that this process may slow down the speed of tokenization, so if you have only a few targets, consider using the replacement code provided above. If the code does not work, please feel free to contact me. Thank you.

import nagisa

class WrappedNagisa:
    def __init__(self, preserve_chars):
        self.preserve_char_to_placeholder = {}
        for i, char in enumerate(preserve_chars):
            placeholder = f"_{i}_"
            self.preserve_char_to_placeholder[char] = placeholder

        self.placeholder_to_preserve_char = {
            v: k for k, v in self.preserve_char_to_placeholder.items()
        }

        self.tagger = nagisa.Tagger(
            single_word_list=self.preserve_char_to_placeholder.values()
        )

    def wakati(self, text):
        text = "".join(
            [
                (
                    self.preserve_char_to_placeholder[char]
                    if char in self.preserve_char_to_placeholder
                    else char
                )
                for char in text
            ]
        )
        words = self.tagger.wakati(text)
        words = [
            (
                self.placeholder_to_preserve_char[word]
                if word in self.placeholder_to_preserve_char
                else word
            )
            for word in words
        ]
        return words

    def tagging(self, text):
        words = self.wakati(text)
        tokens = self.tagger.tagging(text)
        tokens.words.clear()
        tokens.words.extend(words)
        return tokens

if __name__ == "__main__":
    targets = ["(", ")", "〜"]
    wrapped_nagisa = WrappedNagisa(targets)

    text = "これは(サンプル)です〜。これは(サンプル)です~。"
    print(wrapped_nagisa.wakati(text))
    # ['これ', 'は', '(', 'サンプル', ')', 'です', '〜', '。', 'これ', 'は', '(', 'サンプル', ')', 'です', '~', '。']
    print(wrapped_nagisa.tagging(text))
    # これ/代名詞 は/助詞 (/補助記号 サンプル/名詞 )/補助記号 です/助動詞 〜/補助記号 。/補助記号 これ/代名詞 は/助詞 (/補助記号 サンプル/名詞 )/補助記号 です/助動詞 ~/助詞 。/補助記号
OishiUnagi commented 2 months ago

Hi Taishi-i. Thank you for your answer. This is very helpful.

OishiUnagi commented 2 months ago

Hi @taishi-i I have another issue, that Nagisa changes hankaku space to zenkaku space \u3000. I tried the method above, but id doesnot work. Can you guide me with this issue. Thank you

taishi-i commented 2 months ago

Hi @OishiUnagi! Thank you for your comment. When I added " " to targets as shown below, I was able to retain spaces in my environment. I apologize for the lack of explanation. Could you please add the characters you want to retain to targets and execute the code?

targets = ["(", ")", "〜", " "]
wrapped_nagisa = WrappedNagisa(targets)

text = "これは(サンプル)です〜。これは ( sample ) です~。"
print(wrapped_nagisa.wakati(text))
# ['これ', 'は', '(', 'サンプル', ')', 'です', '〜', '。', 'これ', 'は', ' ', '(', ' ', 'sample', ' ', ')', ' ', 'です', '~', '。']

print(wrapped_nagisa.tagging(text))
# これ/代名詞 は/助詞 (/補助記号 サンプル/名詞 )/補助記号 です/助動詞 〜/補助記号 。/補助記号 これ/代名詞 は/助詞  /名詞 (/補助記号  /補助記号 sample/英単語  /補助記号 )/補助記号  /補助記号 です/助動詞 ~/助詞 。/補助記号

If it still doesn't work, could you share your code? I'll check the issue.

OishiUnagi commented 2 months ago

Hi @taishi-i . It works now, thank you for all your help.

taishi-i commented 2 months ago

Hi @OishiUnagi! Thank you as well. If there are any unclear points with the program or if the problem remains unresolved, please feel free to reopen this issue and comment again. As this issue has been resolved, I will now close it. Thank you again for your cooperation.