Align Romaji and Kana - Githubissues

kyle-v6x commented 6 months ago

Hello, thanks for the great work on this.

I have a use case where I need to make use of both romaji (nihon) and kana. In another issue regarding furigana you mention you can use fugashi as such:

import fugashi

tagger = fugashi.Tagger()
kana = [nn.feature.kana for nn in tagger("吾輩は猫である")]
# => ['ワガハイ', 'ハ', 'ネコ', 'デ', 'アル']

However, it seems the space-handling in this library is slightly customized.

import fugashi
import cutlet

tagger = fugashi.Tagger()
nihon = cutlet.Cutlet(use_foreign_spelling=False, system="nihon")

raw_text = 'また、東寺のように、五大明王と呼ばれる、主要な明王の中央に配されることも多い。'
romaji = nihon.romaji(raw_text)
kana = " ".join([nn.feature.kana for nn in tagger("また、東寺のように、五大明王と呼ばれる、主要な明王の中央に配されることも多い。")])
kana_romaji = nihon.romaji(kana)

print(f"Romaji text has {len(romaji.split(' '))} words, but kana text has {len(kana.split(' '))} words.")
print(f"This means that we can not align the romaji and kana text for any use case.")
print(f"Correct romaji: {romaji}\nKana: {kana}\nRomaji from kana: {kana_romaji}")

> Romaji text has 19 words, but kana text has 26 words.
> This means that we can not align the romaji and kana text for any use case.
> Correct romaji: Mata, Touzi no you ni, go daimyouou to yobareru, syuyou na myouou no tyuuou ni haisareru koto mo ooi.
> Kana: マタ  トウジ ノ ヨウ ニ  ゴ ダイ ミョウオウ ト ヨバ レル  シュヨウ ナ ミョウオウ ノ チュウオウ ニ ハイサ レル コト モ オオイ 
> Romaji from kana: Mata touzi no you ni godai myouou to yoba reru syuyou na myouou no tyuu Ou ni haisa reru koto mo ooi

Could we optionally provide the raw kana returned with the romaji? If so this would be the one Japanese processing library to rule them all.

polm commented 6 months ago

I recently added a romaji_tokens function. Is this what you want?

import cutlet

katsu = cutlet.Cutlet()
text = 'また、東寺のように、五大明王と呼ばれる、主要な明王の中央に配されることも多い。'
words = katsu.tagger(cutlet.normalize_text(text))
toks = katsu.romaji_tokens(words)

for tok, word in zip(toks, words):
    print(word.surface, tok.surface, sep="\t")

kyle-v6x commented 6 months ago

Ideally a token like ('東寺', 'Touzi') would be ('トウジ', 'Touzi') so that I can align between katakana and romaji, but I can use this to make my own aligner!

The reason I need both is that we use a Forced Alignment model that's trained on Hiragana/Katakana only, so my word timings are in that format, but we want to use Cutlet for Japanese text processing.

polm commented 6 months ago

Glad to hear this works for you. Note you can get kana from token.feature.pron or token.feature.kana - see the unidic-py README for details.

polm / cutlet

Align Romaji and Kana #46