Cutlet creates additional spaces in some words written in Latin alphabet

polm / cutlet

Japanese to romaji converter in Python

https://polm.github.io/cutlet/

MIT License

299 stars 21 forks source link

Cutlet creates additional spaces in some words written in Latin alphabet #21

Closed Lili1228 closed 3 years ago

Lili1228 commented 3 years ago

I don't know if it's cutlet's or cutlet's dependency's fault, but I'm trying here.

Python 3.7.3 (default, Jul 25 2020, 13:03:44) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cutlet
>>> katsu = cutlet.Cutlet()
>>> text = '私は Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch にい ま す'
>>> katsu.romaji(text)
'Watakushi wa L l a n f a i r p w l l g w y n g y l l g o g e r y c h w y r n d robwllllantysiliogogogoch ni ima su'

polm commented 3 years ago

This is an issue with the way the underlying library, MeCab, works. All words need costs, and lower cost words are preferred. If a word is not in the dictionary, it calculates the cost based on the length and the type of character (latin, hiragana, kanji, etc.). Because costs are calculated for the sequence as a whole, at certain lengths it's cheaper to break the sequence up than to treat it as a single word.

In my tests with the latest version of unidic-lite this only happens if the unknown input is more than 25 characters. Are you actually processing the names of long Welsh towns or is there something else you're doing where this comes up a lot? It may be possible to change this behavior by modifying the dictionary settings but it seems like it's not an issue for normal usage.

Lili1228 commented 3 years ago

That's the only case I've found so far and I assume the only times I'd trigger it otherwise would be if I input a random garbage (which sometimes can happen). While I don't think it's worth doing that for everyone in that case, can you tell me how to change that setting?

polm commented 3 years ago

On investigation, this isn't actually due to cost calculations. It happens because MeCab has a hard cap on the length of unknown words. You can change this value by passing -M [number] to the Tagger, so for you the fix would look like this:

import cutlet
import fugashi

cut = cutlet.Cutlet()
tagger = fugashi.Tagger('-M 100')
cut.tagger = tagger
# now you can get unknown words up to length 101

The maximum length specification seems to have an off-by-one-error, so it's actually one longer than the number you specify.

Let me know if that fixes it for you.

Lili1228 commented 3 years ago

It fixed my problem, thank you!

Lili1228 commented 3 years ago

Reopening because while the previous thing was rather insignificant and fixable, this is definitely bad:

Python 3.7.3 (default, Jul 25 2020, 13:03:44) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cutlet
>>> x=cutlet.Cutlet()
>>> x.romaji("McDonald's")
"McDonald ' s"

I use Cutlet combined with Text-to-Speech and no English voice synthesizer is able to pronounce this correctly with spacing like that.

polm commented 3 years ago

Ah, good catch, I'll look at fixing that.

polm commented 3 years ago

This should be fixed in the latest version, please confirm if it works for you.

I also looked at handing quotes in general, for sentences like It's 'delicious.' but that ended up being much more complicated, partly because MeCab sticks punctuation together. Since cutlet isn't really designed to take already-English input like that anyway I'm treating it as out of scope.

Lili1228 commented 3 years ago

It works only if it's the only word:

Python 3.7.3 (default, Jul 25 2020, 13:03:44) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cutlet
>>> x=cutlet.Cutlet()
>>> x.romaji("McDonald's")
"McDonald's"
>>> x.romaji("text McDonald's text")
"Text McDonald ' s text"

polm commented 3 years ago

Ah, that's embarrassing, but I think I found the issue. Should be fixed in master, I'll release a test alpha tomorrow.

polm commented 3 years ago

Should be fixed in alpha now, please confirm.

pip install cutlet==0.1.17a2

Lili1228 commented 3 years ago

Works well, thank you!

polm commented 3 years ago

Great, thanks for the confirmation, I'll make a release shortly.