Closed TaakoMagnusen closed 9 months ago
It sounds like the romaji_tokens
function from the latest release should help you. It can be used like this:
from cutlet import Cutlet, normalize_text
text = "... whatever ..."
cut = Cutlet()
toks = cut.tagger(normalize_text(text))
romaji_tokens = cut.romaji_tokens(toks)
You will then have one romaji token for each token in the original sentence.
This isn't exactly a mapping of characters to romaji - that's not possible in the general case, since you have 熟字訓 words like 再従兄弟 (hatoko) where the mapping is only at the level of the whole term.
I think there are some tools that map individual characters from Japanese input to kana - maybe part of Rikai-chan? - but I haven't looked at it in a while.
Closing because I believe there's nothing to do here beyond the solution outlined above, but if you (or someone else) have some feedback feel free to follow up.
I am writing a script that uses the Whisper to transcribe japanese speech and i'd like to use cutlet to produce a romaji transcription. Right now i'm a little stuck because the output of whisper when using
word_timestamps=True
can produce word segments that break up multi-character words. So when i use cutlet to transcribe entire sentence segments output by whisper, it works fine, but i'd like a map of the individual word timings so that i can create a text animation that highlights the romaji as the words are said in the audio.Here's an example of the issue:
Full segment output from whisper and cutlet
but the way this is broken up by whisper is the following:
as you can see using
cutlet.romaji()
on each "word" as defined by the whisper transcription doesn't work. I tried usingcutlet.romaji_word()
but got this error:I've attached the full output of whisper transcription for the example above (includes the entire transcription of the content i'm transcribing): whisper_transcription.json
(btw the speech i'm transcribing is the lyrics to the following song: https://www.youtube.com/watch?v=ZAJ3nfQTw4A)
Use the following code with the attached json to show the output.
If i had a list mapping how the characters from the full sentence are used to create the romaji i would be able to cycle through the characters in the mapping and find the start and end of positions of those characters to create a map of the start/end positions of the romaji.
Thanks for your time and for this library! It's incredibly useful and easy to use!