tmalsburg / guess-language.el

Emacs minor mode that detects the language you're typing in. Automatically switches spell checker. Supports multiple languages per document.
115 stars 14 forks source link

Comparing this package with similar ones #4

Closed peterwvj closed 7 years ago

peterwvj commented 7 years ago

I was looking for a package that would enable me to automatically switch the dictionary of my spell-checker. In addition to this package, I also found auto-dictionary-mode.

I think it would be helpful to add a small section in the README to explain the most important differences between this package and similar ones. It seems to me like this package and auto-dictionary-mode are trying to achieve the same goal. I'm interested in knowing what I gain by using this package in particular (rather than auto-dictionary-mode). Although auto-dictionary-mode does not seem to be actively maintained, it is quite popular (based on its number of downloads). Also, can you say anything about the performance of these two packages?

Thanks in advance.

tmalsburg commented 7 years ago

In order to write such a comparison, I'd have to do some research on the alternatives and I don't have time for that. I tried a couple of solutions over the years and they all didn't work well. Most are based on lists of words and that is just not a very clever approach because I can easily write a German text while avoiding most words on the list. To illustrate this, auto-dictionary-mode uses a list of German words that includes words like "gemäß" which are very German but also quite rare. Checking for the presence of this word is not going to help in 99.9% of the cases. We could make the word lists really comprehensive but that would make language detection rather expensive.

The algorithm I'm using is based on common letter trigrams and you need much less text to produce a good guess. In my experience, 30 characters are enough to correctly identify the language even when the languages are similar (e.g., Germanic or Romance languages). When the candidate languages are dissimilar, as little as three letters can be enough.

Regarding performance, counting occurrences of a small set of trigrams in a paragraph is super cheap. No issues there. Plus, guess-language-mode only guesses the language when it’s necessary, i.e., when you type a word that your spell checker doesn’t recognize. In the best case, language detection does not run at all, namely when your spell checker is already set to the language you’re typing in. If it’s not, guess-language-mode will run once you hit the first word that is not recognized and then you are good until you type another word that isn't recognized. In contrast to that, auto-dictionary-mode seems to run every time you stop typing, whether it’s necessary or not.

Another drawback of auto-dictionary-mode is that it’s not really automatic: I have to stop typing to trigger language detection, so I have to consciously do something. In contrast to that, guess-language-mode tries to be completely transparent. You set it up once, and then you can forget that it exists. That’s the idea anyway.

But as I said, I didn't have time to do research on the alternatives, plus I'm obviously biased. So, I suggest that you simply install guess-language-mode and see whether it works for you. Please send feedback if you encounter any issues. This is brand new code, and although it works really well for me, there are probably some corner cases that I didn't anticipate.

peterwvj commented 7 years ago

Thanks for the excellent explanation - this is exactly the kind of comparison I was hoping for. In addition to auto-dictionary-mode I'm not sure there are any viable alternatives anyway. I'll leave it up to you to decide whether you want to add a similar description in the README. For what it's worth, I found your comparison very helpful.

For sure I'm going to try out this package. I'll let you know if I experience issues of any sort.

tmalsburg commented 7 years ago

Glad you found this useful but I don't know auto-dictionary-mode nearly well enough to be able to say with any certainty that my code is better.