Closed Lili1228 closed 3 years ago
This is an issue with the way the underlying library, MeCab, works. All words need costs, and lower cost words are preferred. If a word is not in the dictionary, it calculates the cost based on the length and the type of character (latin, hiragana, kanji, etc.). Because costs are calculated for the sequence as a whole, at certain lengths it's cheaper to break the sequence up than to treat it as a single word.
In my tests with the latest version of unidic-lite this only happens if the unknown input is more than 25 characters. Are you actually processing the names of long Welsh towns or is there something else you're doing where this comes up a lot? It may be possible to change this behavior by modifying the dictionary settings but it seems like it's not an issue for normal usage.
That's the only case I've found so far and I assume the only times I'd trigger it otherwise would be if I input a random garbage (which sometimes can happen). While I don't think it's worth doing that for everyone in that case, can you tell me how to change that setting?
On investigation, this isn't actually due to cost calculations. It happens because MeCab has a hard cap on the length of unknown words. You can change this value by passing -M [number]
to the Tagger, so for you the fix would look like this:
import cutlet
import fugashi
cut = cutlet.Cutlet()
tagger = fugashi.Tagger('-M 100')
cut.tagger = tagger
# now you can get unknown words up to length 101
The maximum length specification seems to have an off-by-one-error, so it's actually one longer than the number you specify.
Let me know if that fixes it for you.
It fixed my problem, thank you!
Reopening because while the previous thing was rather insignificant and fixable, this is definitely bad:
Python 3.7.3 (default, Jul 25 2020, 13:03:44)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cutlet
>>> x=cutlet.Cutlet()
>>> x.romaji("McDonald's")
"McDonald ' s"
I use Cutlet combined with Text-to-Speech and no English voice synthesizer is able to pronounce this correctly with spacing like that.
Ah, good catch, I'll look at fixing that.
This should be fixed in the latest version, please confirm if it works for you.
I also looked at handing quotes in general, for sentences like It's 'delicious.'
but that ended up being much more complicated, partly because MeCab sticks punctuation together. Since cutlet isn't really designed to take already-English input like that anyway I'm treating it as out of scope.
It works only if it's the only word:
Python 3.7.3 (default, Jul 25 2020, 13:03:44)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cutlet
>>> x=cutlet.Cutlet()
>>> x.romaji("McDonald's")
"McDonald's"
>>> x.romaji("text McDonald's text")
"Text McDonald ' s text"
Ah, that's embarrassing, but I think I found the issue. Should be fixed in master, I'll release a test alpha tomorrow.
Should be fixed in alpha now, please confirm.
pip install cutlet==0.1.17a2
Works well, thank you!
Great, thanks for the confirmation, I'll make a release shortly.
I don't know if it's cutlet's or cutlet's dependency's fault, but I'm trying here.