Closed kyle-v6x closed 8 months ago
I recently added a romaji_tokens
function. Is this what you want?
import cutlet
katsu = cutlet.Cutlet()
text = 'また、東寺のように、五大明王と呼ばれる、主要な明王の中央に配されることも多い。'
words = katsu.tagger(cutlet.normalize_text(text))
toks = katsu.romaji_tokens(words)
for tok, word in zip(toks, words):
print(word.surface, tok.surface, sep="\t")
Ideally a token like ('東寺', 'Touzi')
would be ('トウジ', 'Touzi')
so that I can align between katakana and romaji, but I can use this to make my own aligner!
The reason I need both is that we use a Forced Alignment model that's trained on Hiragana/Katakana only, so my word timings are in that format, but we want to use Cutlet for Japanese text processing.
Glad to hear this works for you. Note you can get kana from token.feature.pron
or token.feature.kana
- see the unidic-py README for details.
Hello, thanks for the great work on this.
I have a use case where I need to make use of both romaji (nihon) and kana. In another issue regarding furigana you mention you can use fugashi as such:
However, it seems the space-handling in this library is slightly customized.
Could we optionally provide the raw kana returned with the romaji? If so this would be the one Japanese processing library to rule them all.