polm / cutlet

Japanese to romaji converter in Python
https://polm.github.io/cutlet/
MIT License
286 stars 20 forks source link

Is there a good way to detect non-Japanese text? #30

Closed Infinoid closed 2 years ago

Infinoid commented 2 years ago

I'm calling Cutlet.romaji() to convert japanese text to romaji, and it's working great. Thanks for the awesome library.

But due to the nature of the data I'm working with, I get the occasional Korean or English string in the mix, and the output for Korean text looks like '???????'.

Rather than writing code to detect whether the output string contains mostly question marks, is there a clean way to detect non-Japanese text?

polm commented 2 years ago

There is not any specific cutlet feature for detecting non-Japanese text.

Besides checking the input string yourself using regexes or something, MeCab has a feature called char_type which is present on the Nodes you get in fugashi. It doesn't recognize hangul specifically, but it has categories like ALPHA and SYMBOL (separate from categories for kanji and kana) that should let you detect it.

I also haven't tried this before, but you could maybe add hangul to the mapping tables cutlet uses internally.

You might also be able to make use of unihandecode, which handles Korean and Japanese.

Infinoid commented 2 years ago

I appreciate the advice. This is definitely a case of bad input and not cutlet's fault... and short of raising a NotJapaneseError or something when presented with unrecognized characters, there isn't much cutlet could do about it.

I'm currently thinking I should run unknown text through chardet or something first, before deciding what to do with it.

Thanks!

polm commented 2 years ago

Having an option to throw an error on text that would be rendered as ? actually sounds like it might be a good idea, I'll think about it!

To give me a little more idea of your use case, are you using this on:

Infinoid commented 2 years ago

I'm using it for book titles. Mostly Japanese light novels, but there are occasional non-Japanese ones thrown in too.

polm commented 2 years ago

Got it, thanks for the clarification!