Open YumYummity opened 4 months ago
Sorry you're having trouble with this, that's a very strange error. Those are all extremely common characters (except ★ I guess), so it seems unlikely cutlet would generally be unable to handle them. In particular, it shouldn't be possible for kanji to be passed to map_kana
.
What version of cutlet are you using? How are you initializing the katsu
object?
This code worked for me, for example - can you see if it works for you, outside of your application?
import cutlet
katsu = cutlet.Cutlet()
for char in "彼★大繋森":
print(katsu.romaji(char))
Sorry you're having trouble with this, that's a very strange error. Those are all extremely common characters (except ★ I guess), so it seems unlikely cutlet would generally be unable to handle them. In particular, it shouldn't be possible for kanji to be passed to
map_kana
.What version of cutlet are you using? How are you initializing the
katsu
object?
I am using 0.4.0 of cutlet.
C:\Users\YumYummity>pip freeze | findstr "cutlet"
cutlet==0.4.0
I initialize katsu inside a class:
class SOMECLASS:
def __init__(self):
# some code
def some_func(self):
self.katsu = cutlet.Cutlet()
self.katsu.romaji("some japanese")
This code worked for me, for example - can you see if it works for you, outside of your application?
import cutlet katsu = cutlet.Cutlet() for char in "彼★大繋森": print(katsu.romaji(char))
Works fine. Hmm
C:\Users\YumYummity>python
Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug 1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import cutlet
>>>
>>> katsu = cutlet.Cutlet()
>>>
>>> for char in "彼★大繋森":
... print(katsu.romaji(char))
...
Kare
Oo
Tsunagi
Mori
>>>
OK, thanks for confirming details and that the code sample ran. Nothing looks wrong with your initialization, so I'm not really sure what could cause this.
I'll try a few more things, but if you can come up with an example I can run that has the same issue I could debug it.
There is one thing I understood - since map_kana
is being called at line 322 in v0.4.0, that means that the token is being detected as katakana or hiragana, which is obviously wrong. Are you using a custom dictionary or MeCab build in your actual code that isn't in the short example I gave?
OK, thanks for confirming details and that the code sample ran. Nothing looks wrong with your initialization, so I'm not really sure what could cause this.
I'll try a few more things, but if you can come up with an example I can run that has the same issue I could debug it.
There is one thing I understood - since
map_kana
is being called at line 322 in v0.4.0, that means that the token is being detected as katakana or hiragana, which is obviously wrong. Are you using a custom dictionary or MeCab build in your actual code that isn't in the short example I gave?
I believe I'm using unidic-lite
, with no modifications.
Also, another thing I've noticed is that the KeyError doesn't happen every time, just rarely on some characters.
Hm, that's weird. If you're using a standard setup then I'm not sure how I can reproduce it.
The best thing would be if you could find a snippet of code that reproduces it for me to try, though it sounds like that may be hard.
When you say it doesn't happen every time, do you mean that given the same input, sometimes it happens and sometimes it doesn't? If that's the case I guess it could be a threading issue, though I think that shouldn't happen...
If you do have a reproducible case locally, what you can do is drop a breakpoint or print statement to see why the token is being detected as kana. The place to do that would be this line. The main thing to check woult be word.surface
and word.char_type
. If those values look weird then other things should be considered.
Hm, that's weird. If you're using a standard setup then I'm not sure how I can reproduce it.
The best thing would be if you could find a snippet of code that reproduces it for me to try, though it sounds like that may be hard.
When you say it doesn't happen every time, do you mean that given the same input, sometimes it happens and sometimes it doesn't? If that's the case I guess it could be a threading issue, though I think that shouldn't happen...
If you do have a reproducible case locally, what you can do is drop a breakpoint or print statement to see why the token is being detected as kana. The place to do that would be this line. The main thing to check woult be
word.surface
andword.char_type
. If those values look weird then other things should be considered.
Same input, yeah. The data that usually errors is from this file: https://github.com/Sekai-World/sekai-master-db-diff/blob/main/events.json
I don't have it reproducible sadly as it appears random, and has only happened a few times. If you want, I can send you the class where Cutlet is used, but it's pretty long.
Also, speaking of threading, the function that calls Cutlet is being threaded:
threading.Thread(target=self.refresh_data).start()
# Cutlet is initialized and called in self.refresh_data()
Update: it happened again.
(this is a mess of an error, as it's technically two errors being printed at the same time as the thread ran twice in a short time)
Exception in thread Thread-21 (refresh_data):
Traceback (most recent call last):
File "C:\Development\Python\Python3.10.6\lib\threading.py", line 1016, in _bootstrap_inner
Exception in thread Thread-23 (refresh_data):
Traceback (most recent call last):
File "C:\Development\Python\Python3.10.6\lib\threading.py", line 1016, in _bootstrap_inner
self.run()
File "C:\Development\Python\Python3.10.6\lib\threading.py", line 953, in run
self.run()
File "C:\Development\Python\Python3.10.6\lib\threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\YumYummity\Desktop\bot\twitch\pjsk.py", line 2210, in refresh_data
self._target(*self._args, **self._kwargs)
File "C:\Users\YumYummity\Desktop\bot\twitch\pjsk.py", line 2209, in refresh_data
self._title_maps[self.katsu.romaji(title)] = data["id"]
File "C:\Development\Python\Python3.10.6\lib\site-packages\cutlet\cutlet.py", line 299, in romaji
self._title_maps[self.katsu_foreignless.romaji(title)] = data["id"]
File "C:\Development\Python\Python3.10.6\lib\site-packages\cutlet\cutlet.py", line 299, in romaji
tokens = self.romaji_tokens(words, capitalize, title)
File "C:\Development\Python\Python3.10.6\lib\site-packages\cutlet\cutlet.py", line 222, in romaji_tokens
tokens = self.romaji_tokens(words, capitalize, title) roma = self.romaji_word(word)
File "C:\Development\Python\Python3.10.6\lib\site-packages\cutlet\cutlet.py", line 222, in romaji_tokens
File "C:\Development\Python\Python3.10.6\lib\site-packages\cutlet\cutlet.py", line 322, in romaji_word
roma = self.romaji_word(word)
File "C:\Development\Python\Python3.10.6\lib\site-packages\cutlet\cutlet.py", line 322, in romaji_word
return self.map_kana(kana)return self.map_kana(kana)
File "C:\Development\Python\Python3.10.6\lib\site-packages\cutlet\cutlet.py", line 367, in map_kana
out += self.get_single_mapping(pk, char, nk) File "C:\Development\Python\Python3.10.6\lib\site-packages\cutlet\cutlet.py", line 367, in map_kana
File "C:\Development\Python\Python3.10.6\lib\site-packages\cutlet\cutlet.py", line 420, in get_single_mapping
out += self.get_single_mapping(pk, char, nk)
return self.table[kk]
KeyError File "C:\Development\Python\Python3.10.6\lib\site-packages\cutlet\cutlet.py", line 420, in get_single_mapping
: '灰'
return self.table[kk]
KeyError: '幸'
Hm, it seems very likely it's a threading issue.
Can you try making a single cutlet object per thread and see if that resolves the issue? That won't help resolving the root cause but I would expect it to fix your problem.
Hm, it seems very likely it's a threading issue.
Can you try making a single cutlet object per thread and see if that resolves the issue? That won't help resolving the root cause but I would expect it to fix your problem.
I don't think it's a threading issue, as I've checked the first error I sent. The error was from before I implemented threading.
Oh, do you have example code that causes the error without threading? If you do definitely send it to me. Reviewing things here, your first example has references to threading.py
, so I assumed all your code was using threading.
I assume threading is involved because your error seems to be running through this code:
if word.char_type in (CHAR_HIRAGANA, CHAR_KATAKANA):
kana = jaconv.kata2hira(word.surface)
return self.map_kana(kana)
With your error, characters that are obviously not hiragana or katakana are going inside this if
block anyway. The char_type
here is set by MeCab based on specific lists of characters, so there's normally no way it could go wrong. It also shouldn't normally be an issue with threaded code, but MeCab has some internal memory management that's complicated and so that's why I think that might be a possible source of the issue.
If it's not threading related... I don't have many ideas about what could cause an issue with the character type. I'll have to look at the implementation again.
Oh, do you have example code that causes the error without threading? If you do definitely send it to me. Reviewing things here, your first example has references to
threading.py
, so I assumed all your code was using threading.I assume threading is involved because your error seems to be running through this code:
if word.char_type in (CHAR_HIRAGANA, CHAR_KATAKANA): kana = jaconv.kata2hira(word.surface) return self.map_kana(kana)
With your error, characters that are obviously not hiragana or katakana are going inside this
if
block anyway. Thechar_type
here is set by MeCab based on specific lists of characters, so there's normally no way it could go wrong. It also shouldn't normally be an issue with threaded code, but MeCab has some internal memory management that's complicated and so that's why I think that might be a possible source of the issue.If it's not threading related... I don't have many ideas about what could cause an issue with the character type. I'll have to look at the implementation again.
I believe the first traceback I sent didn't use threading (it also doesn't say exception in Thread-NUM:
).
Here's what my implementation currently looks like (cutting some irrelevant variables and code)
class pjsk_data:
def __init__(self):
self._refreshed_at = 0
self.refresh_data()
def refresh_data(self):
# URLS
url_jp = "https://sekai-world.github.io/sekai-master-db-diff/musics.json"
url2_jp = "https://sekai-world.github.io/sekai-master-db-diff/musicDifficulties.json"
url = "https://sekai-world.github.io/sekai-master-db-en-diff/musics.json"
url2 = "https://sekai-world.github.io/sekai-master-db-en-diff/musicDifficulties.json"
url3_jp = "https://sekai-world.github.io/sekai-master-db-diff/musicTags.json"
url3 = "https://sekai-world.github.io/sekai-master-db-en-diff/musicTags.json"
event_url = "https://sekai-world.github.io/sekai-master-db-en-diff/events.json"
event_url_jp = "https://sekai-world.github.io/sekai-master-db-diff/events.json"
# Functions and Tools
def simplify_title(title):
# Remove all non-alphanumeric characters and convert to lowercase
simplified_title = re.sub(r'[^a-zA-Z0-9\s]', '', title).lower()
# Remove extra whitespace
simplified_title = re.sub(r'\s+', ' ', simplified_title).strip()
return simplified_title
self.katsu = cutlet.Cutlet()
self.katsu_foreignless = cutlet.Cutlet()
self.katsu_foreignless.use_foreign_spelling = False
# Requests
songs = requests.get(url).json()
songs_difficulties = requests.get(url2).json()
jp_songs_difficulties = requests.get(url2_jp).json()
jp_songs = requests.get(url_jp).json()
jp_tag_data = requests.get(url3_jp).json()
tag_data = requests.get(url3).json()
events = requests.get(event_url).json()
events_jp = requests.get(event_url_jp).json()
# Maps
self.tag_map = {
"all": None,
"other": "Other",
"none": "No Main Unit",
"vocaloid": "VIRTUAL SINGER",
"piapro": "VIRTUAL SINGER",
"school_refusal": "Nightcord at 25:00",
"light_sound": "Leo/need",
"light_music_club": "Leo/need",
"idol": "MORE MORE JUMP!",
"street": "Vivid BAD SQUAD",
"theme_park": "Wonderlands×Showtime"
}
self.event_type_map = {
"marathon": "Marathon",
"cheerful_carnival": "Cheerful Carnival",
"world_bloom": "World Link"
}
self.custom_title_definitions = {
99: [ # MORE! JUMP! MORE!
"mjm",
"dakara motto"
],
135: [ # Six Trillion Years and Overnight Story
"six trillion"
],
162: [ # End Mark ni Kibou to Namida wo soete
"endmark"
],
164: [ # Don't Fight The Music
"dftm"
],
176: [ # Machinegun Poem DOll
"mgpd"
],
186: [ # Hatsune Creation Myth
"hcm"
],
226: [ # Lost and Found
"lnf",
"kimino"
],
250: [ # Kusare-Gedou and Chocolate
"kusare-gedou",
"chocolate boss song"
],
251: [ # Fräulein=библиотека
"Fraulein"
],
315: [ # What's Up? Pop!
"wup"
],
328: [ # Sekai-Chan and Kafu-Chan's Otsukai Gassoukyoku
"sekai-chan",
"kafu-chan"
],
396: [ # 東京テディベア
"ttb",
"tokyo teddy bear"
],
503: [ # 超最終鬼畜妹フランドール・S
"flandre s"
],
}
# Title Maps
self._title_maps = {}
for data in songs:
title = data["title"].lower().strip()
simplified_title = simplify_title(title)
self._title_maps[title] = data["id"]
if simplified_title != title:
self._title_maps[simplified_title] = data["id"]
for data in jp_songs:
title = data["title"].strip()
self._title_maps[title] = data["id"]
self._title_maps[self.katsu.romaji(title)] = data["id"]
self._title_maps[self.katsu_foreignless.romaji(title)] = data["id"]
# Check if there are custom titles defined for this data["id"]
if data["id"] in self.custom_title_definitions:
custom_titles = self.custom_title_definitions[data["id"]]
for custom_title in custom_titles:
self._title_maps[custom_title.lower().strip()] = data["id"]
# Event Maps
self._event_maps = {}
for data in events:
title = data["name"].lower().strip()
simplified_title = simplify_title(title)
self._event_maps[title] = data["id"]
if simplified_title != title:
self._event_maps[simplified_title] = data["id"]
for data in events_jp:
title = data["name"].strip()
self._event_maps[title] = data["id"]
self._event_maps[self.katsu.romaji(title)] = data["id"]
self._event_maps[self.katsu_foreignless.romaji(title)] = data["id"]
# Check if there are custom titles defined for this data["id"]
# if data["id"] in self.custom_title_definitions:
# custom_titles = self.custom_title_definitions[data["id"]]
# for custom_title in custom_titles:
# self._event_maps[custom_title.lower().strip()] = data["id"]
and how it's called:
def _check_refresh(self):
if self._refreshed_at < time.time() - 3600: # 3600 seconds = 1 hour
threading.Thread(target=self.refresh_data).start()
@property
def title_maps(self):
self._check_refresh()
return self._title_maps
Apologies for the delayed reply. There is an actual bug here, but it is not in Cutlet, and your issue must be threading related.
Your example code is not doing anything that would cause a problem. However, despite what you said, your first traceback is definitely using threading, see the first lines referencing threading.py
:
Traceback (most recent call last):
File "C:\Development\Python\Python3.10.6\lib\threading.py", line 1016, in _bootstrap_inner
self.run()
Given the file path of your code, is this a twitch bot that takes callbacks or something? Is some of the code async?
The actual bug is that the character class is being invalidated / overwritten when parse
is called in MeCab. Here is a minimal example of the bug:
from fugashi import Tagger
tagger = Tagger()
xx = tagger("日本語")
print(xx[0].char_type) # => 2
tagger("にほんご") # note this is not assigned anywhere
print(xx[0].char_type) # => 6, this is wrong
However, because Cutlet manages the nodes and tagger internally, it is not possible to for this to happen unless something is happening inside a single Cutlet call, which I can only imagine being caused by threading or async code. You should be able to resolve your issue by using one Cutlet object per thread, and making sure threads are not suspended while Cutlet is making individual calls.
Hi, thanks for the help! Yeah I didn't see the threading reference in my first error.
The code is meant to fetch some data from a repository and use it in a twitch bot; and refreshes the data every N seconds. This is when a new thread is made to run the fetching and romaji conversions.
The twitch bot is async. The function in which Cutlet is called is not; however it is being threaded.
I'll try your suggestion, thanks!
More characters (same traceback, just not including the full thing):
KeyError: '★'
KeyError: '大'
KeyError: '繋'
KeyError: '森'
may find more in the future