Closed echan00 closed 8 months ago
Glad you're enjoying the library. Please note the word "romaji" has no "n".
何人ですか would normally be read "Nannin desu ka" ("How many people?"). "Nanijin" is also valid but drastically less common.
If you need aligned tokens and romaji you can use the romaji_tokens
functionality that was added recently.
import cutlet
katsu = cutlet.Cutlet()
words = cutlet.tagger(cutlet.normalize_text("何人ですか"))
toks = katsu.romaji_tokens(words)
for tok, word in zip(toks, words):
print(word.surface, tok.surface, sep="\t")
Oh you're a life saver. Am I missing something?
AttributeError: module 'cutlet' has no attribute 'tagger'
EDIT: I am using cutlet 0.3.0
To clarify: [何,名,様,ですか ] -> [Nan,na,you,desu ka] (currently incorrect when each token is converted separately) [何,名,様,ですか ] -> [Nan,mei,sama,desu ka] (expected correct version)
Sorry, there was a mistake in the example code. It should be katsu.tagger
- the tagger object is on the instance, not the class.
Amaze balls!
Hi Paul
Thanks for the awesome library. I have a problem where I'm trying to convert a tokenized Japanese text to Romanji.
" 何人ですか ?" is correctly "Nanijindesu ka"?
But if i tokenize the text to [何, 人, ですか, ?] and convert each token to romanji it is incorrect because the text is missing.
How would I covert Japanese text to Romanji so I can get two matching tokenized arrays?