robotdana / spellr

Spell check your source code
MIT License
34 stars 2 forks source link

Ignore tokens matching character ranges #75

Open voidless opened 3 years ago

voidless commented 3 years ago

Hi! Is it possible to add an option to ignore character ranges for tokens? If the whole token matches one ignored character set then it will be skipped. This will still prevent mixed languages in a word but will ignore languages with different character sets.

We (unfortunately) write some comments and strings in Russian and it triggers a Spellr warning almost every time Simple dictionary checking doesn't work well with languages that has many cases (ex: Russian, Hindi) because you have to add all cases for each word to validate properly, and I was unable to find such dictionaries.

voidless commented 3 years ago

I've found Russian dictionary with cases (35MB), it will work for our case

robotdana commented 3 years ago

hi did your found dictionary solve your problem? is it a public dictionary that i could link for others in the documentation? how is the performance of spellr with a 35MB wordlist?

robotdana commented 3 years ago

ignoring character range thing is interesting though, i'll look into that, because it's already a problem for chinese and other scripts that don't really use word breaks. it should be doable in the regex with ([[:alpha:]](?<!\p{Cyrillic}) or similar, i'll have a think about how to get that from the config to the regexes.

voidless commented 3 years ago

I've used dictionary from this repo: https://github.com/danakt/russian-words 35MB is in unicode, original file was 2 times smaller in cp1251 encoding

Spellr completes in around 4 secs for 650k lines of code on my 6 core macbook

We are very happy with the results, now we spend less time on trivial errors during code review We even found a few errors in our localization files