Add api to get character to romaji map as list of dicts

TaakoMagnusen commented 10 months ago

I am writing a script that uses the Whisper to transcribe japanese speech and i'd like to use cutlet to produce a romaji transcription. Right now i'm a little stuck because the output of whisper when using word_timestamps=True can produce word segments that break up multi-character words. So when i use cutlet to transcribe entire sentence segments output by whisper, it works fine, but i'd like a map of the individual word timings so that i can create a text animation that highlights the romaji as the words are said in the audio.

Here's an example of the issue:

Full segment output from whisper and cutlet

raw whisper segment:     大体私ら知らなくて 特にもいけない今日だって
cutlet full segment:     daitai watakushira shiranakute tokuni mo ikenai kyou da tte

but the way this is broken up by whisper is the following:

whisper per word:    大-体-私-ら-知-ら-なく-て- 特-に-も-い-け-ない-今日-だ-って
cutlet per word:     oo-karada-watakushi-ra-chi-ra-naku-te-toku-ni-mo-i-ke-nai-kyou-da-tte

as you can see using cutlet.romaji() on each "word" as defined by the whisper transcription doesn't work. I tried using cutlet.romaji_word() but got this error:

AttributeError                            Traceback (most recent call last)
Cell In[38], line 14
     11 print(f'cutlet full segment:\t {katsu.romaji(full_segment, capitalize=False)}')
     13 single_word_whisper_line = '-'.join([word for word in word_list])
---> 14 single_word_romaji_line = '-'.join([katsu.romaji_word(word) for word in word_list])
     15 print(f'whisper per word:\t {single_word_whisper_line}')
     16 print(f'cutlet per word:\t {single_word_romaji_line}')

Cell In[38], line 14, in <listcomp>(.0)
     11 print(f'cutlet full segment:\t {katsu.romaji(full_segment, capitalize=False)}')
     13 single_word_whisper_line = '-'.join([word for word in word_list])
---> 14 single_word_romaji_line = '-'.join([katsu.romaji_word(word) for word in word_list])
     15 print(f'whisper per word:\t {single_word_whisper_line}')
     16 print(f'cutlet per word:\t {single_word_romaji_line}')

File ~/.pyenv/versions/karagen/lib/python3.11/site-packages/cutlet/cutlet.py:319, in Cutlet.romaji_word(self, word)
    316 def romaji_word(self, word):
    317     """Return the romaji for a single word (node)."""
--> 319     if word.surface in self.exceptions:
    320         return self.exceptions[word.surface]
    322     if word.surface.isdigit():

AttributeError: 'str' object has no attribute 'surface'

I've attached the full output of whisper transcription for the example above (includes the entire transcription of the content i'm transcribing): whisper_transcription.json

(btw the speech i'm transcribing is the lyrics to the following song: https://www.youtube.com/watch?v=ZAJ3nfQTw4A)

Use the following code with the attached json to show the output.

import json
import cutlet

katsu = cutlet.Cutlet()
katsu.use_foreign_spelling = False

with open('/Users/silman/Desktop/whisper_transcription.json', 'r') as f:
  data = json.load(f)

for sentence_segment in data['segments']:
    word_list = list()
    for word_segment in sentence_segment['words']:
        word_list.append(word_segment['word'])

    full_segment = ''.join([word for word in word_list])
    print(f'raw whisper segment:\t {full_segment}')
    print(f'cutlet full segment:\t {katsu.romaji(full_segment, capitalize=False)}')

    single_word_whisper_line = '-'.join([word for word in word_list])
    single_word_romaji_line = '-'.join([katsu.romaji(word) for word in word_list])
    print(f'whisper per word:\t {single_word_whisper_line}')
    print(f'cutlet per word:\t {single_word_romaji_line}')

If i had a list mapping how the characters from the full sentence are used to create the romaji i would be able to cycle through the characters in the mapping and find the start and end of positions of those characters to create a map of the start/end positions of the romaji.

Thanks for your time and for this library! It's incredibly useful and easy to use!

polm commented 10 months ago

It sounds like the romaji_tokens function from the latest release should help you. It can be used like this:

from cutlet import Cutlet, normalize_text

text = "... whatever ..."
cut = Cutlet()
toks = cut.tagger(normalize_text(text))

romaji_tokens = cut.romaji_tokens(toks)

You will then have one romaji token for each token in the original sentence.

This isn't exactly a mapping of characters to romaji - that's not possible in the general case, since you have 熟字訓 words like 再従兄弟 (hatoko) where the mapping is only at the level of the whole term.

I think there are some tools that map individual characters from Japanese input to kana - maybe part of Rikai-chan? - but I haven't looked at it in a while.

polm commented 9 months ago

Closing because I believe there's nothing to do here beyond the solution outlined above, but if you (or someone else) have some feedback feel free to follow up.

polm / cutlet

Add api to get character to romaji map as list of dicts #40