Converting to romanji when the text is tokenized

polm / cutlet

Japanese to romaji converter in Python

https://polm.github.io/cutlet/

MIT License

286 stars 20 forks source link

Converting to romanji when the text is tokenized #45

Closed echan00 closed 6 months ago

echan00 commented 6 months ago

Hi Paul

Thanks for the awesome library. I have a problem where I'm trying to convert a tokenized Japanese text to Romanji.

" 何人ですか ?" is correctly "Nanijindesu ka"?

But if i tokenize the text to [何, 人, ですか, ?] and convert each token to romanji it is incorrect because the text is missing.

How would I covert Japanese text to Romanji so I can get two matching tokenized arrays?

polm commented 6 months ago

Glad you're enjoying the library. Please note the word "romaji" has no "n".

何人ですか would normally be read "Nannin desu ka" ("How many people?"). "Nanijin" is also valid but drastically less common.

If you need aligned tokens and romaji you can use the romaji_tokens functionality that was added recently.

import cutlet

katsu = cutlet.Cutlet()
words = cutlet.tagger(cutlet.normalize_text("何人ですか"))
toks = katsu.romaji_tokens(words)

for tok, word in zip(toks, words):
    print(word.surface, tok.surface, sep="\t")

echan00 commented 6 months ago

Oh you're a life saver. Am I missing something?

AttributeError: module 'cutlet' has no attribute 'tagger'

EDIT: I am using cutlet 0.3.0

To clarify: [何,名,様,ですか ] -> [Nan,na,you,desu ka] (currently incorrect when each token is converted separately) [何,名,様,ですか ] -> [Nan,mei,sama,desu ka] (expected correct version)

polm commented 6 months ago

Sorry, there was a mistake in the example code. It should be katsu.tagger - the tagger object is on the instance, not the class.

echan00 commented 6 months ago

Amaze balls!