Closed polm closed 3 years ago
This would require tokenizing the input Japanese text into words, which is a large problem requiring lots of data and a machine learning algorithm to do well. I would recommend that you not pursue the issue as written, as it would significantly increase the scope or dependencies of this project, and a major advantage or your current project is its focus and lightweight/pure Python nature.
Perhaps it would still be useful to have a method that takes a romanized, space-separated string as input and outputs a title-cased version of the same. You might still consider it out of scope, but it would at least not require any heavy library additions.
@garfieldnate This library already relies on MeCab and a dictionary to do Japanese tokenization, not sure what you're talking about.
Just went to the demo and realized this T_T. Sorry, you can ignore my comment. Looks like a pretty simple feature to implement.
I don't think this strictly needs to be a part of the library. One could always just post-process the output to have it capitalize everything except the particles.
particles = ["no", "wa", "ga", "mo", "to", "ka", "ni"]
title_case = lambda s: " ".join([x.capitalize() if x not in particles else x.lower() for x in s.split()])
print(title_case("kino no tabi"))
Then just feed the output of cutlet
to title_case
. But I suppose it could be added as a convenience function.
I'm also not sure if there's any edge-cases where you'd get more accurate results by doing the particle-detection on the hiragana at the token level instead of the resulting romaji (I can't think of any cases at the moment) Ah I suppose some words like 荷 and 和 are such an edge-case where you can't just post-process the romaji. So I suppose it is indeed better to integrate this as part of the library.
One other idea which might make adding similar such features cleaner.
Instead of directly parsing the mecab output returning the romaji string, you could instead have an intermediate representation that is an array of (JP, romaji)
pairs. E.g. right now we have
katsu.romaji("カツカレーは美味しい") # 'Cutlet curry wa oishii'
but consider if you instead rewrote the parser to return
katsu.map_romaji_word("カツカレーは美味しい") # [('カツ', 'cutlet'), ('カレー', 'curry'), ('は', 'wa'), ('美味しい', 'oishii')]
Now the final output can be represented as a transformation of the above intermediary. For current behavior, we just take the second item of each pair and concat them with a space. For title-case behavior, you can check if the corresponding JP token is a particle or not. (Or maybe you can add additional metadata obtained from mecab such as part-of-speech in the intermediary?). You could also add an option to capitalize proper nouns, etc. So essentially you decouple the parsing itself with the final user representation.
And end-user may also want to make use of the intermediate array directly if they're displaying some sort of furigana-type thing.
I made a quick and ugly proof-of-concept of this (based on an older version of Cutlet though, so you may need to rebase):
def map_romaji_word(self, text):
"""Return an intermediary array of (JP, romaji) tokens
"""
if not text:
return ''
# convert all full-width alphanum to half-width, since it can go out as-is
text = mojimoji.zen_to_han(text, kana=False)
# replace half-width katakana with full-width
text = mojimoji.han_to_zen(text, digit=False, ascii=False)
words = self.tagger(text)
tempeng = ''
tempjp = ''
out = []
for wi, word in enumerate(words):
pw = words[wi - 1] if wi > 0 else None
nw = words[wi + 1] if wi < len(words) - 1 else None
# resolve split verbs / adjectives
roma = self.romaji_word(word)
if roma and tempeng and tempeng[-1] == 'っ':
tempeng = tempeng[:-1] + roma[0]
if word.feature.pos2 == '固有名詞':
roma = roma.title()
# handle punctuation with atypical spacing
if word.surface in '「『':
tempeng += ' ' + roma
tempjp += word.surface
continue
if roma in '([':
tempeng += ' ' + roma
tempjp += roma
continue
if roma == '/':
tempeng += '/'
tempjp += '/'
continue
tempeng += roma
tempjp += word.surface
# no space sometimes
# お酒 -> osake
if word.feature.pos1 == '接頭辞': continue
# 今日、 -> kyou, ; 図書館 -> toshokan
if nw and nw.feature.pos1 in ('補助記号', '接尾辞'): continue
# special case for half-width commas
if nw and nw.surface == ',': continue
# 思えば -> omoeba
if nw and nw.feature.pos2 in ('接続助詞'): continue
# 333 -> 333 ; this should probably be handled in mecab
if (word.surface.isdigit() and
nw and nw.surface.isdigit()):
continue
# そうでした -> sou deshita
if (nw and word.feature.pos1 in ('動詞', '助動詞','形容詞')
and nw.feature.pos1 == '助動詞'
and nw.surface != 'です'):
continue
out.append((tempjp, tempeng))
tempjp = ''
tempeng = ''
return out
@krackers Thanks for working on this. Returning tuples rather than a full-formed string is a good idea, and would make integration with other tools easier, but it requires thorough refactoring and testing.
For now I've implemented and pushed a small change to get this working, though it needs more examples to test on.
This is supported in the latest release so closing for now.
It should be possible to support title case, so that all words except particles are capitalized. So この世界の片隅に would be "Kono Sekai no Katasumi ni".