polm / cutlet

Japanese to romaji converter in Python
https://polm.github.io/cutlet/
MIT License
309 stars 21 forks source link

Support Title Case #15

Closed polm closed 3 years ago

polm commented 4 years ago

It should be possible to support title case, so that all words except particles are capitalized. So この世界の片隅に would be "Kono Sekai no Katasumi ni".

garfieldnate commented 4 years ago

This would require tokenizing the input Japanese text into words, which is a large problem requiring lots of data and a machine learning algorithm to do well. I would recommend that you not pursue the issue as written, as it would significantly increase the scope or dependencies of this project, and a major advantage or your current project is its focus and lightweight/pure Python nature.

Perhaps it would still be useful to have a method that takes a romanized, space-separated string as input and outputs a title-cased version of the same. You might still consider it out of scope, but it would at least not require any heavy library additions.

polm commented 4 years ago

@garfieldnate This library already relies on MeCab and a dictionary to do Japanese tokenization, not sure what you're talking about.

garfieldnate commented 4 years ago

Just went to the demo and realized this T_T. Sorry, you can ignore my comment. Looks like a pretty simple feature to implement.

krackers commented 4 years ago

I don't think this strictly needs to be a part of the library. One could always just post-process the output to have it capitalize everything except the particles.

particles = ["no", "wa", "ga", "mo", "to", "ka", "ni"]
title_case = lambda s: " ".join([x.capitalize() if x not in particles else x.lower() for x in s.split()])
print(title_case("kino no tabi"))

Then just feed the output of cutlet to title_case. But I suppose it could be added as a convenience function.

I'm also not sure if there's any edge-cases where you'd get more accurate results by doing the particle-detection on the hiragana at the token level instead of the resulting romaji (I can't think of any cases at the moment) Ah I suppose some words like 荷 and 和 are such an edge-case where you can't just post-process the romaji. So I suppose it is indeed better to integrate this as part of the library.

krackers commented 4 years ago

One other idea which might make adding similar such features cleaner.

Instead of directly parsing the mecab output returning the romaji string, you could instead have an intermediate representation that is an array of (JP, romaji) pairs. E.g. right now we have

katsu.romaji("カツカレーは美味しい")   # 'Cutlet curry wa oishii'

but consider if you instead rewrote the parser to return

 katsu.map_romaji_word("カツカレーは美味しい")   # [('カツ', 'cutlet'), ('カレー', 'curry'), ('は', 'wa'), ('美味しい', 'oishii')]

Now the final output can be represented as a transformation of the above intermediary. For current behavior, we just take the second item of each pair and concat them with a space. For title-case behavior, you can check if the corresponding JP token is a particle or not. (Or maybe you can add additional metadata obtained from mecab such as part-of-speech in the intermediary?). You could also add an option to capitalize proper nouns, etc. So essentially you decouple the parsing itself with the final user representation.

And end-user may also want to make use of the intermediate array directly if they're displaying some sort of furigana-type thing.

I made a quick and ugly proof-of-concept of this (based on an older version of Cutlet though, so you may need to rebase):

 def map_romaji_word(self, text):
        """Return an intermediary array of (JP, romaji) tokens
        """
        if not text:
            return ''

        # convert all full-width alphanum to half-width, since it can go out as-is
        text = mojimoji.zen_to_han(text, kana=False)
        # replace half-width katakana with full-width
        text = mojimoji.han_to_zen(text, digit=False, ascii=False)

        words = self.tagger(text)

        tempeng = ''
        tempjp = ''
        out = []

        for wi, word in enumerate(words):
            pw = words[wi - 1] if wi > 0 else None
            nw = words[wi + 1] if wi < len(words) - 1 else None

            # resolve split verbs / adjectives
            roma = self.romaji_word(word)
            if roma and tempeng and tempeng[-1] == 'っ':
                tempeng = tempeng[:-1] + roma[0]
            if word.feature.pos2 == '固有名詞':
                roma = roma.title()
            # handle punctuation with atypical spacing
            if word.surface in '「『':
                tempeng += ' ' + roma
                tempjp += word.surface
                continue
            if roma in '([':
                tempeng += ' ' + roma
                tempjp += roma
                continue
            if roma == '/':
                tempeng += '/'
                tempjp += '/'
                continue
            tempeng += roma
            tempjp += word.surface

            # no space sometimes
            # お酒 -> osake
            if word.feature.pos1 == '接頭辞': continue
            # 今日、 -> kyou, ; 図書館 -> toshokan
            if nw and nw.feature.pos1 in ('補助記号', '接尾辞'): continue
            # special case for half-width commas
            if nw and nw.surface == ',': continue
            # 思えば -> omoeba
            if nw and nw.feature.pos2 in ('接続助詞'): continue
            # 333 -> 333 ; this should probably be handled in mecab
            if (word.surface.isdigit() and 
                    nw and nw.surface.isdigit()):
                continue
            # そうでした -> sou deshita
            if (nw and word.feature.pos1 in ('動詞', '助動詞','形容詞')
                   and nw.feature.pos1 == '助動詞'
                   and nw.surface != 'です'):
                continue
            out.append((tempjp, tempeng))
            tempjp = ''
            tempeng = ''
        return out
polm commented 4 years ago

@krackers Thanks for working on this. Returning tuples rather than a full-formed string is a good idea, and would make integration with other tools easier, but it requires thorough refactoring and testing.

For now I've implemented and pushed a small change to get this working, though it needs more examples to test on.

polm commented 3 years ago

This is supported in the latest release so closing for now.