polm / cutlet

Japanese to romaji converter in Python
https://polm.github.io/cutlet/
MIT License
309 stars 21 forks source link

Inconsistent KeyErrors in server code #59

Open YumYummity opened 4 months ago

YumYummity commented 4 months ago
Traceback (most recent call last):
  File "C:\Development\Python\Python3.10.6\lib\threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "C:\Development\Python\Python3.10.6\lib\threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\YumYummity\Desktop\bot\twitch\pjsk.py", line 2201, in refresh_data
    self._event_maps[self.katsu.romaji(title)] = data["id"]
  File "C:\Development\Python\Python3.10.6\lib\site-packages\cutlet\cutlet.py", line 299, in romaji
    tokens = self.romaji_tokens(words, capitalize, title)
  File "C:\Development\Python\Python3.10.6\lib\site-packages\cutlet\cutlet.py", line 222, in romaji_tokens
    roma = self.romaji_word(word)
  File "C:\Development\Python\Python3.10.6\lib\site-packages\cutlet\cutlet.py", line 322, in romaji_word
    return self.map_kana(kana)
  File "C:\Development\Python\Python3.10.6\lib\site-packages\cutlet\cutlet.py", line 367, in map_kana
    out += self.get_single_mapping(pk, char, nk)
  File "C:\Development\Python\Python3.10.6\lib\site-packages\cutlet\cutlet.py", line 420, in get_single_mapping
    return self.table[kk]
KeyError: '彼'

More characters (same traceback, just not including the full thing): KeyError: '★' KeyError: '大' KeyError: '繋' KeyError: '森'

may find more in the future

polm commented 4 months ago

Sorry you're having trouble with this, that's a very strange error. Those are all extremely common characters (except ★ I guess), so it seems unlikely cutlet would generally be unable to handle them. In particular, it shouldn't be possible for kanji to be passed to map_kana.

What version of cutlet are you using? How are you initializing the katsu object?

This code worked for me, for example - can you see if it works for you, outside of your application?

import cutlet

katsu = cutlet.Cutlet()

for char in "彼★大繋森":
    print(katsu.romaji(char))
YumYummity commented 4 months ago

Sorry you're having trouble with this, that's a very strange error. Those are all extremely common characters (except ★ I guess), so it seems unlikely cutlet would generally be unable to handle them. In particular, it shouldn't be possible for kanji to be passed to map_kana.

What version of cutlet are you using? How are you initializing the katsu object?

I am using 0.4.0 of cutlet.

C:\Users\YumYummity>pip freeze | findstr "cutlet"
cutlet==0.4.0

I initialize katsu inside a class:

class SOMECLASS:
   def __init__(self):
       # some code

   def some_func(self):
        self.katsu = cutlet.Cutlet()
        self.katsu.romaji("some japanese")

This code worked for me, for example - can you see if it works for you, outside of your application?

import cutlet

katsu = cutlet.Cutlet()

for char in "彼★大繋森":
    print(katsu.romaji(char))

Works fine. Hmm

C:\Users\YumYummity>python
Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug  1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import cutlet
>>>
>>> katsu = cutlet.Cutlet()
>>>
>>> for char in "彼★大繋森":
...     print(katsu.romaji(char))
...
Kare

Oo
Tsunagi
Mori
>>>
polm commented 4 months ago

OK, thanks for confirming details and that the code sample ran. Nothing looks wrong with your initialization, so I'm not really sure what could cause this.

I'll try a few more things, but if you can come up with an example I can run that has the same issue I could debug it.

There is one thing I understood - since map_kana is being called at line 322 in v0.4.0, that means that the token is being detected as katakana or hiragana, which is obviously wrong. Are you using a custom dictionary or MeCab build in your actual code that isn't in the short example I gave?

YumYummity commented 4 months ago

OK, thanks for confirming details and that the code sample ran. Nothing looks wrong with your initialization, so I'm not really sure what could cause this.

I'll try a few more things, but if you can come up with an example I can run that has the same issue I could debug it.

There is one thing I understood - since map_kana is being called at line 322 in v0.4.0, that means that the token is being detected as katakana or hiragana, which is obviously wrong. Are you using a custom dictionary or MeCab build in your actual code that isn't in the short example I gave?

I believe I'm using unidic-lite, with no modifications.

Also, another thing I've noticed is that the KeyError doesn't happen every time, just rarely on some characters.

polm commented 4 months ago

Hm, that's weird. If you're using a standard setup then I'm not sure how I can reproduce it.

The best thing would be if you could find a snippet of code that reproduces it for me to try, though it sounds like that may be hard.

When you say it doesn't happen every time, do you mean that given the same input, sometimes it happens and sometimes it doesn't? If that's the case I guess it could be a threading issue, though I think that shouldn't happen...

If you do have a reproducible case locally, what you can do is drop a breakpoint or print statement to see why the token is being detected as kana. The place to do that would be this line. The main thing to check woult be word.surface and word.char_type. If those values look weird then other things should be considered.

YumYummity commented 4 months ago

Hm, that's weird. If you're using a standard setup then I'm not sure how I can reproduce it.

The best thing would be if you could find a snippet of code that reproduces it for me to try, though it sounds like that may be hard.

When you say it doesn't happen every time, do you mean that given the same input, sometimes it happens and sometimes it doesn't? If that's the case I guess it could be a threading issue, though I think that shouldn't happen...

If you do have a reproducible case locally, what you can do is drop a breakpoint or print statement to see why the token is being detected as kana. The place to do that would be this line. The main thing to check woult be word.surface and word.char_type. If those values look weird then other things should be considered.

Same input, yeah. The data that usually errors is from this file: https://github.com/Sekai-World/sekai-master-db-diff/blob/main/events.json

I don't have it reproducible sadly as it appears random, and has only happened a few times. If you want, I can send you the class where Cutlet is used, but it's pretty long.

Also, speaking of threading, the function that calls Cutlet is being threaded:

threading.Thread(target=self.refresh_data).start()
# Cutlet is initialized and called in self.refresh_data()
YumYummity commented 4 months ago

Update: it happened again.

(this is a mess of an error, as it's technically two errors being printed at the same time as the thread ran twice in a short time)

Exception in thread Thread-21 (refresh_data):
Traceback (most recent call last):
  File "C:\Development\Python\Python3.10.6\lib\threading.py", line 1016, in _bootstrap_inner
Exception in thread Thread-23 (refresh_data):
Traceback (most recent call last):
  File "C:\Development\Python\Python3.10.6\lib\threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "C:\Development\Python\Python3.10.6\lib\threading.py", line 953, in run
    self.run()
  File "C:\Development\Python\Python3.10.6\lib\threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\YumYummity\Desktop\bot\twitch\pjsk.py", line 2210, in refresh_data
    self._target(*self._args, **self._kwargs)
  File "C:\Users\YumYummity\Desktop\bot\twitch\pjsk.py", line 2209, in refresh_data
    self._title_maps[self.katsu.romaji(title)] = data["id"]
  File "C:\Development\Python\Python3.10.6\lib\site-packages\cutlet\cutlet.py", line 299, in romaji
    self._title_maps[self.katsu_foreignless.romaji(title)] = data["id"]
  File "C:\Development\Python\Python3.10.6\lib\site-packages\cutlet\cutlet.py", line 299, in romaji
    tokens = self.romaji_tokens(words, capitalize, title)
  File "C:\Development\Python\Python3.10.6\lib\site-packages\cutlet\cutlet.py", line 222, in romaji_tokens
    tokens = self.romaji_tokens(words, capitalize, title)    roma = self.romaji_word(word)
  File "C:\Development\Python\Python3.10.6\lib\site-packages\cutlet\cutlet.py", line 222, in romaji_tokens

  File "C:\Development\Python\Python3.10.6\lib\site-packages\cutlet\cutlet.py", line 322, in romaji_word
    roma = self.romaji_word(word)
  File "C:\Development\Python\Python3.10.6\lib\site-packages\cutlet\cutlet.py", line 322, in romaji_word
        return self.map_kana(kana)return self.map_kana(kana)
  File "C:\Development\Python\Python3.10.6\lib\site-packages\cutlet\cutlet.py", line 367, in map_kana

    out += self.get_single_mapping(pk, char, nk)  File "C:\Development\Python\Python3.10.6\lib\site-packages\cutlet\cutlet.py", line 367, in map_kana

  File "C:\Development\Python\Python3.10.6\lib\site-packages\cutlet\cutlet.py", line 420, in get_single_mapping
    out += self.get_single_mapping(pk, char, nk)
    return self.table[kk]
KeyError  File "C:\Development\Python\Python3.10.6\lib\site-packages\cutlet\cutlet.py", line 420, in get_single_mapping
: '灰'
    return self.table[kk]
KeyError: '幸'
polm commented 4 months ago

Hm, it seems very likely it's a threading issue.

Can you try making a single cutlet object per thread and see if that resolves the issue? That won't help resolving the root cause but I would expect it to fix your problem.

YumYummity commented 4 months ago

Hm, it seems very likely it's a threading issue.

Can you try making a single cutlet object per thread and see if that resolves the issue? That won't help resolving the root cause but I would expect it to fix your problem.

I don't think it's a threading issue, as I've checked the first error I sent. The error was from before I implemented threading.

polm commented 4 months ago

Oh, do you have example code that causes the error without threading? If you do definitely send it to me. Reviewing things here, your first example has references to threading.py, so I assumed all your code was using threading.

I assume threading is involved because your error seems to be running through this code:

if word.char_type in (CHAR_HIRAGANA, CHAR_KATAKANA):
    kana = jaconv.kata2hira(word.surface)
    return self.map_kana(kana)

With your error, characters that are obviously not hiragana or katakana are going inside this if block anyway. The char_type here is set by MeCab based on specific lists of characters, so there's normally no way it could go wrong. It also shouldn't normally be an issue with threaded code, but MeCab has some internal memory management that's complicated and so that's why I think that might be a possible source of the issue.

If it's not threading related... I don't have many ideas about what could cause an issue with the character type. I'll have to look at the implementation again.

YumYummity commented 4 months ago

Oh, do you have example code that causes the error without threading? If you do definitely send it to me. Reviewing things here, your first example has references to threading.py, so I assumed all your code was using threading.

I assume threading is involved because your error seems to be running through this code:

if word.char_type in (CHAR_HIRAGANA, CHAR_KATAKANA):
    kana = jaconv.kata2hira(word.surface)
    return self.map_kana(kana)

With your error, characters that are obviously not hiragana or katakana are going inside this if block anyway. The char_type here is set by MeCab based on specific lists of characters, so there's normally no way it could go wrong. It also shouldn't normally be an issue with threaded code, but MeCab has some internal memory management that's complicated and so that's why I think that might be a possible source of the issue.

If it's not threading related... I don't have many ideas about what could cause an issue with the character type. I'll have to look at the implementation again.

I believe the first traceback I sent didn't use threading (it also doesn't say exception in Thread-NUM:).

Here's what my implementation currently looks like (cutting some irrelevant variables and code)

class pjsk_data:
    def __init__(self):
        self._refreshed_at = 0
        self.refresh_data()

    def refresh_data(self):
        # URLS
        url_jp = "https://sekai-world.github.io/sekai-master-db-diff/musics.json"
        url2_jp = "https://sekai-world.github.io/sekai-master-db-diff/musicDifficulties.json"
        url = "https://sekai-world.github.io/sekai-master-db-en-diff/musics.json"
        url2 = "https://sekai-world.github.io/sekai-master-db-en-diff/musicDifficulties.json"
        url3_jp = "https://sekai-world.github.io/sekai-master-db-diff/musicTags.json"
        url3 = "https://sekai-world.github.io/sekai-master-db-en-diff/musicTags.json"
        event_url = "https://sekai-world.github.io/sekai-master-db-en-diff/events.json"
        event_url_jp = "https://sekai-world.github.io/sekai-master-db-diff/events.json"

        # Functions and Tools
        def simplify_title(title):
            # Remove all non-alphanumeric characters and convert to lowercase
            simplified_title = re.sub(r'[^a-zA-Z0-9\s]', '', title).lower()
            # Remove extra whitespace
            simplified_title = re.sub(r'\s+', ' ', simplified_title).strip()
            return simplified_title
        self.katsu = cutlet.Cutlet()
        self.katsu_foreignless = cutlet.Cutlet()
        self.katsu_foreignless.use_foreign_spelling = False

        # Requests
        songs = requests.get(url).json()
        songs_difficulties = requests.get(url2).json()
        jp_songs_difficulties = requests.get(url2_jp).json()
        jp_songs = requests.get(url_jp).json()
        jp_tag_data = requests.get(url3_jp).json()
        tag_data = requests.get(url3).json()
        events = requests.get(event_url).json()
        events_jp = requests.get(event_url_jp).json()

                # Maps
        self.tag_map = {
            "all": None,
            "other": "Other",
            "none": "No Main Unit",
            "vocaloid": "VIRTUAL SINGER",
            "piapro": "VIRTUAL SINGER",
            "school_refusal": "Nightcord at 25:00",
            "light_sound": "Leo/need",
            "light_music_club": "Leo/need",
            "idol": "MORE MORE JUMP!",
            "street": "Vivid BAD SQUAD",
            "theme_park": "Wonderlands×Showtime"
        }
        self.event_type_map = {
            "marathon": "Marathon",
            "cheerful_carnival": "Cheerful Carnival",
            "world_bloom": "World Link"
        }
        self.custom_title_definitions = {
            99: [ # MORE! JUMP! MORE!
                "mjm",
                "dakara motto"
            ],
            135: [ # Six Trillion Years and Overnight Story
                "six trillion"
            ],
            162: [ # End Mark ni Kibou to Namida wo soete
                "endmark"
            ],
            164: [ # Don't Fight The Music
                "dftm"
            ],
            176: [ # Machinegun Poem DOll
                "mgpd"
            ],
            186: [ # Hatsune Creation Myth
                "hcm"
            ],
            226: [ # Lost and Found
                "lnf",
                "kimino"
            ],
            250: [ # Kusare-Gedou and Chocolate
                "kusare-gedou",
                "chocolate boss song"
            ],
            251: [ # Fräulein=библиотека
                "Fraulein"
            ],
            315: [ # What's Up? Pop!
                "wup"
            ],
            328: [ # Sekai-Chan and Kafu-Chan's Otsukai Gassoukyoku
                "sekai-chan",
                "kafu-chan"
            ],
            396: [ # 東京テディベア
                "ttb",
                "tokyo teddy bear"
            ],
            503: [ # 超最終鬼畜妹フランドール・S
                "flandre s"
            ],
        }

        # Title Maps
        self._title_maps = {}
        for data in songs:
            title = data["title"].lower().strip()
            simplified_title = simplify_title(title)
            self._title_maps[title] = data["id"]
            if simplified_title != title:
                self._title_maps[simplified_title] = data["id"]
        for data in jp_songs:
            title = data["title"].strip()
            self._title_maps[title] = data["id"]
            self._title_maps[self.katsu.romaji(title)] = data["id"]
            self._title_maps[self.katsu_foreignless.romaji(title)] = data["id"]
            # Check if there are custom titles defined for this data["id"]
            if data["id"] in self.custom_title_definitions:
                custom_titles = self.custom_title_definitions[data["id"]]
                for custom_title in custom_titles:
                    self._title_maps[custom_title.lower().strip()] = data["id"]

        # Event Maps
        self._event_maps = {}
        for data in events:
            title = data["name"].lower().strip()
            simplified_title = simplify_title(title)
            self._event_maps[title] = data["id"]
            if simplified_title != title:
                self._event_maps[simplified_title] = data["id"]
        for data in events_jp:
            title = data["name"].strip()
            self._event_maps[title] = data["id"]
            self._event_maps[self.katsu.romaji(title)] = data["id"]
            self._event_maps[self.katsu_foreignless.romaji(title)] = data["id"]
            # Check if there are custom titles defined for this data["id"]
            # if data["id"] in self.custom_title_definitions:
            #     custom_titles = self.custom_title_definitions[data["id"]]
            #     for custom_title in custom_titles:
            #         self._event_maps[custom_title.lower().strip()] = data["id"]

and how it's called:

    def _check_refresh(self):
        if self._refreshed_at < time.time() - 3600:  # 3600 seconds = 1 hour
            threading.Thread(target=self.refresh_data).start()

    @property
    def title_maps(self):
        self._check_refresh()
        return self._title_maps
polm commented 1 week ago

Apologies for the delayed reply. There is an actual bug here, but it is not in Cutlet, and your issue must be threading related.

Your example code is not doing anything that would cause a problem. However, despite what you said, your first traceback is definitely using threading, see the first lines referencing threading.py:

Traceback (most recent call last):
  File "C:\Development\Python\Python3.10.6\lib\threading.py", line 1016, in _bootstrap_inner
    self.run()

Given the file path of your code, is this a twitch bot that takes callbacks or something? Is some of the code async?

The actual bug is that the character class is being invalidated / overwritten when parse is called in MeCab. Here is a minimal example of the bug:

from fugashi import Tagger
tagger = Tagger()
xx = tagger("日本語")
print(xx[0].char_type) # => 2
tagger("にほんご") # note this is not assigned anywhere
print(xx[0].char_type) # => 6, this is wrong

However, because Cutlet manages the nodes and tagger internally, it is not possible to for this to happen unless something is happening inside a single Cutlet call, which I can only imagine being caused by threading or async code. You should be able to resolve your issue by using one Cutlet object per thread, and making sure threads are not suspended while Cutlet is making individual calls.

YumYummity commented 1 week ago

Hi, thanks for the help! Yeah I didn't see the threading reference in my first error.

The code is meant to fetch some data from a repository and use it in a twitch bot; and refreshes the data every N seconds. This is when a new thread is made to run the fetching and romaji conversions.

The twitch bot is async. The function in which Cutlet is called is not; however it is being threaded.

I'll try your suggestion, thanks!