polm / cutlet

Japanese to romaji converter in Python
https://polm.github.io/cutlet/
MIT License
286 stars 20 forks source link

How to use Exceptions properly? #52

Closed Shun-Ibiki closed 4 months ago

Shun-Ibiki commented 5 months ago

Hey. First, let me give thanks for your great works. Then, I create a exceptions.csv like this:

"転生";"tensei"
"だって";"datte"
"ラノベ";"light novel"
"んですか";"ndesuka"
"でも";"demo"
"美少女";"bishoujo"
"んだが";"ndaga"

and only 転生 and ラノベ that applied. meanwhile for others, like んだが converted to "n-daga" and 美少女 became "bi-shoujo". Can you tell me whats wrong here?

fyi, i use below code to apply exceptions:

import csv
with open("_exceptions.csv") as fd:
    rd = csv.reader(fd, delimiter=";", quotechar='"')
    for row in rd:
        katsu.add_exception(row[0],row[1])
polm commented 5 months ago

You're not using the exceptions API wrong, but it's very limited in what it can do. Exceptions are very primitive and only work for words that correspond to single tokens.

You can debug this by checking the tokenization of a sentence, like so:

import fugashi
tagger = fugashi.Tagger()
for tok in tagger("転生する美少女の小説を探しているんですが"):
  print(tok)

Anything on a line by itself in the output is a single token and you can use exceptions on it. If it's not on a single line, you have different options but they get a little complicated.

The simplest thing would be to do string replace on the resulting sentence, something like res.replace(" n daga", "ndaga"). Depending on the sequence you have in mind this can be pretty reliable, but sometimes it can be easy to affect the wrong words.

The other thing you can do is start creating a custom tokenizer dictionary. If you have just nouns like 美少女 this is fine, but if you start messing with verbs or particles it might interact with existing dictionary entries strangely, so I wouldn't recommend it. For a guide on creating a custom dictionary, please refer to the MeCab documentation.

Note that the reason these entries are not single words has to do with the UniDic definition of a word, which is also complicated, unfortunately.

Shun-Ibiki commented 4 months ago

Thanks for your quick answer. So basically, exceptions only work for single token.