trigrams for japanese, chinese, korean?

tmalsburg / guess-language.el

Emacs minor mode that detects the language you're typing in. Automatically switches spell checker. Supports multiple languages per document.

115 stars 14 forks source link

trigrams for japanese, chinese, korean? #42

Open mooseyboots opened 1 month ago

mooseyboots commented 1 month ago

hi, i'm interested in using this just for the guess-language part only (i.e. not the typo-mode setting or spellchecking) but using all possible languages.

is it possible that there's no japanese (ja), chinese (zh), and korean (ko) in the trigrams data? or am i confused about it somehow?

i did a few tests with chinese and japanese texts and guess-language-region returned zu, i.e. Zulu.

but i must be a little confused, as guess_language.py supports those languages, but it doesn't have ja, zh, or ko in its trigrams files.

perhaps the python package simply selects those languages (and greek) by their script, using the Blocks.txt file? would it be possible to support that also in guess-language.el?

i guess if that's the issue i'm encountering it would require a bit of work to support those languages in this package...

tmalsburg commented 1 month ago

This package doesn't support ja, zh, ko yet. The algorithm was designed for languages using alphabetic writing systems. Not sure it'll work on for languages using logographic since they likely have many more possible trigrams. The trigrams observed in a short text may not even show up among the top-ranked trigrams of a language, simply because there are so many possible trigrams in a language like Chinese. I guess that's also the reason why guess_language.py doesn't have trigrams for these languages. However, it might be easy to detect these languages based on other features. If you check what guess_languages.py is using, we could perhaps use the same approach here. I imagine that unigrams might work if all these languages have logographs that are sufficiently frequent.

mooseyboots commented 1 month ago

thanks for your response.

my understanding is the guess_languages.py uses https://github.com/kent37/guess-language/blob/master/guess_language/Blocks.txt to determine the writing system, but i don't understand how. (it looks like blocks.py contains a function that determines what block a single character is from?)

i'm also not sure if emacs itself could simply detect a unicode language system?

tmalsburg commented 1 month ago

If I understand correctly, guess_language.py checks for the presence of, e.g., Katakana, and if there is any, it decides that the text must be in Japanese. That's not ideal because if you have an English paragraph with just a single Katakana character, it will misclassify the paragraph as Japanese.

See here: https://github.com/kent37/guess-language/blob/master/guess_language/guess_language.py#L375

mooseyboots commented 1 month ago

i suspect that's not how it works.

the checks in that function run on the arg scripts, and the function is called (in guessLanguage()) with the result of find_runs() as the scripts arg.

and find_runs() explains itself thus:

    # return run types that used for 40% or more of the string
    # always return basic latin if found more than 15%
    # and extended additional latin if over 10% (for Vietnamese)

https://github.com/kent37/guess-language/blob/8983cc0f511ed81495684653e09b1643b8fd92e7/guess_language/guess_language.py#L359

so it sounds like it's adapted for the case you mention? i.e. the result of find_runs, should only contain Katakana if over 40% of the text?

but i'm saying this without knowing the guess language code more than just a casual glance...

tmalsburg commented 3 weeks ago

You may be right, I just had a quick glance at the code. Their approach may be reliable but it's also not terribly elegant. I wonder if we can come up with a unified approach: What if we replace logographic characters with placeholders that simply indicate their category. Then we could perhaps again apply the usual tri-gram approach, so that no separate code-path is needed. This should also work nicely for languages that mix different types of characters like Japanese which uses Chinese kanji, Japanese kanji, hiragana, and katakana, with some Latin characters sprinkled in here and there.