polm / cutlet

Japanese to romaji converter in Python
https://polm.github.io/cutlet/
MIT License
286 stars 20 forks source link

Why are some entries transliterated in this way? #55

Closed shimizurei closed 2 months ago

shimizurei commented 2 months ago

I wrote a simple python function to transliterate given words/phrases from a specific source. These are some of the results:

I'm not too savvy with the stuff under the hood, but why is it having difficulty figuring out the context of the character usage? I'm probably going to have to add some exceptions, aren't I?

polm commented 2 months ago

It looks like are due to a couple of different reasons.

As a general rule, the best way to debug things is to check what the dictionary entries are. You can do this by just running fugashi in a terminal and checking the output for that - the second and fourth fields will be of particular interest.

To get to the point though, I would fix all of these with an exception, as that's the fastest workaround. For お兄さん and お母さん that won't work very well unfortunately, as the components are still tokenized correctly (お/兄/さん), so if you make an exception for the main word it'll be wrong in other circumstances. You could also add a final postprocessing step on the output romaji (OanichanOniichan). This needs a better solution but I think anything robust is going to require some work somewhere, like building a custom dictionary or better model.

お兄ちゃん気質 = Oanichan Kishitsu => should be "Oniichan" みんなのお母さん = Minna no Ohahasan => should be "Okaasan"

UniDic has the right entries for these, but they are not what comes out in normal analysis, so I think this is a problem with the MeCab model distributed with the dictionary.

頑張るユーリ! = Ganbaru Yurii! => should be "Yuuri"

In this case, "Yurii" is registered as a foreign spelling in the dictionary, like "card" for カード. It is one spelling of the name rendered as ユーリ, though the choice here seems arbitrary.

アツい8月 = Achii 8 Tsuki => should be "atsui 8-gatsu"

This is weird. It seems that for forms that use katakana this way, the pron entry adopts an exaggerated pronuncation. So サムい is サミー. It just looks like a mistake to me, and it is not present in the full UniDic.

オクト、植物園へ。

This one was a surprise. The dictionary lemma is オクト-oct-, with a trailing -. I hadn't seen this before, but it seems to be used for prefixes, like "semi-" ("semi-truck"), "bio", "intra", "entero", etc. Cutlet is not processing this correctly and turns the word into the empty string, so that's a bug here.


Sorry I don't have a great solution for any of that, but thanks for reporting these! I'll work on fixing the prefix thing at least.

shimizurei commented 2 months ago

Thank you for the in-depth write-up!

My script is part of a larger project to romanize some titles for a thing I'm doing and I was reviewing the results and building a custom post-processing dictionary by reviewing each output individually (since it doesn't get the character names right 100% of the time).

Here are a few more that cutlet seems to struggle with:

Regarding your points:

UniDic has the right entries for these, but they are not what comes out in normal analysis, so I think this is a problem with the MeCab model distributed with the dictionary.

Should I create an issue on mecab-python3 then?

This is weird. It seems that for forms that use katakana this way, the pron entry adopts an exaggerated pronuncation. So サムい is サミー. It just looks like a mistake to me, and it is not present in the full UniDic.

Should I install the full UniDic for the best accuracy? (the size...-_-)

In order to make the exceptions and post-processing dictionaries properly, the exceptions should use the kanji/kana, while the post-processing acts on the romanized result, correct?

Example:

EXCEPTIONS = {
    'ユーリ': 'Yuuri',
    '環': 'Tamaki', # Character name that keeps being romanized incorrectly
    '虎於': 'Torao',
    'ナギ': 'Nagi', # Keeps being romanized as "Nagy"
    'ラーメン': 'ramen', # Was romanized as "rahmen"???
    '日本語': 'nihongo', # Was romanized as "nippon go" # Or should this go into PP_REPLACEMENTS because "nippon" + "go"
}

# Used in a fxn with replace/re.sub
PP_REPLACEMENTS = {
    'oanichan': 'oniichan',
    'oanisan': 'oniisan',
    'ohahasan': 'okaasan',
    r'~ ': ' ~',
    r' \?': '?', # Sometimes a space is added before the "?" idk why
}

Function:

def setup_cutlet():
    """Set up the Cutlet romanization system."""
    katsu = cutlet.Cutlet()
    katsu.use_foreign_spelling = True
    hello = katsu.romaji("こんにち", capitalize=False) + "wa"
    katsu.add_exception("こんにちは", hello)

    for jp, rom in EXCEPTIONS.items():
        katsu.add_exception(jp, rom)
    return katsu
polm commented 2 months ago

UniDic has the right entries for these, but they are not what comes out in normal analysis, so I think this is a problem with the MeCab model distributed with the dictionary.

Should I create an issue on mecab-python3 then?

Unfortunately no - I don't train the dictionary, I just distribute it. The models are trained by NINJAL, so technically they should fix it, but I don't think they have any particular public suggested method for bug reports.

This is weird. It seems that for forms that use katakana this way, the pron entry adopts an exaggerated pronuncation. So サムい is サミー. It just looks like a mistake to me, and it is not present in the full UniDic.

Should I install the full UniDic for the best accuracy? (the size...-_-)

Yes, the larger and more recent version will be more accurate. The version I distribute via the unidic package on PyPI is also a little out of date - the latest version is on the official page.


About the specific errors...

ŹOOĻで寮生活? = ???? de Ryou Seikatsu? (might be outside its limits?)

This is detected as a non-ASCII, non-Japanese token. Usually that covers things like Cyrillic or other non-latin scripts, but I hadn't considered that it covers non-ASCII Latin. This isn't exactly a bug but it may be possible to improve it, there are existing methods for stripping things to ASCII.

○○パワー! = Power! (lost the marumaru)

In UniDic it looks like ◯ is treated as punctuation and has no reading. You could use a MeCab user dictionary to override this.

那々緒の珍道中 = Na Itoguchi no Chin Douchuu (should be Nana, but it did a Hibi so idk what happened)

Common words with odoriji like 日々 are treated as a single word in UniDic, so cutlet doesn't have to figure out how to interpret them. I put some work into handling odoriji a while ago but didn't have many examples of them not being picked up, so this may just be a bug.

ロボットドールの“役割” = Robot D'Or no Yakuwari " (it seems to have gotten pretty confused)

UniDic has multiple entries for ドール. It looks like for this particular one it picks "d'Or", which is plausible in a generic sense, if not here. I'm not sure why the first quote mark disappears, that's probably a bug.

In order to make the exceptions and post-processing dictionaries properly, the exceptions should use the kanji/kana, while the post-processing acts on the romanized result, correct?

Exceptions in cutlet use the raw form you see in the document, see the included exceptions file. Post processing is not a cutlet feature, so up to you, but it would be easiest to work on romaji.