microsoft / vscode

Visual Studio Code
https://code.visualstudio.com
MIT License
162.64k stars 28.67k forks source link

Offline spell checker for VSCode #20266

Open bartosz-antosik opened 7 years ago

bartosz-antosik commented 7 years ago

Hello (first time contributing here)

There are few offline spell checkers among VSCode extensions, but they are based on seriously faulty JavaScript implementations of Hunspell spell checker.

Hunspell is nowadays probably the most widespread standard for spell check layer. It is used on MacOS, Linux and in some software (e.g. LibreOffice) on Windows. It is also used by both Atom and Sublime Text. There is an enormous collection of polished dictionaries for Hunspell.

There exists some JavaScript implementations that refer to Hunspell's name but in fact they do not implement critical functionality - lexical parser. I have verified these three:

hunspell-spellchecker Typo.js nspell

All three work more or less following a simple idea of loading the dictionary into memory (into a associative table, a.k.a. dictionary, object to be precise). They use the Hunspell's affixes (.aff file) to create ALL variants of the words found in the dictionary (.dic file) and then store them in the memory. When checking spelling dictionary is simply asked whether the word exist or not. Simple, but it has these implications:

  1. Loading takes a lot of time;
  2. It takes a lot of memory too;
  3. Memory consumption causes them to crash under dictionaries with more expanded affix system (two out of three mentioned, third does not consume all of the affixes).

For example when running hunspell-spellchecker (there is a SpellChecker extension based on it) with English dictionary ("en_US", 62K+ words in dictionary) memory consumption is in peaks 500 MB and constantly above 250 MB. It crashes under Polish language dictionary ("pl_PL", 300K+ words in dictionary) after reaching about 1.5 GB memory consumed (there are reports about other dictionaries doing the same) with "JavaScript heap out of memory" message hidden well under the hood. Hunspell has a lexical parser which allows it to use these two sets (dictionary and affixes) "on the fly" without the need to merge them thus exploding memory consumption and load time.

There is a good spell checker component for node.js, which is actually a bindings for native spell checkers for MacOS (NSSpellChecker), Linux (Hunspell) and Windows (Spell Check API in windows 8+, Hunspell in earlier versions):

https://github.com/atom/node-spellchecker

It is alas a native module.

I have built a spell checker using this module. I will rather not publish it because it is quite pointless:

So I would like you to consider doing something about it.

There are few paths I can imagine among them two are most obvious:

  1. Build the node-spellchecker module along with the VSCode and make it available among "standard" modules that extension developers can count upon (this could result in more than one spell checker extension e.g. for spelling text or latex documents, comments in code etc.);
  2. Provide a way to use native modules among extensions' dependencies.

I am most probably no one to discuss pros or cons of these alternatives, there are maybe other alternatives that I cannot see, but I think that with the evidence provided it is clear that unless something changes the answer to the question in the title is MOST PROBABLY NOT!

Tyriar commented 7 years ago

I'd personally like to see spell checking using native platform API bundled eventually.

bartosz-antosik commented 7 years ago

Do I get it correctly than by native you mean implemented in JavaScript/TypeScript?

I understand your point of view, I would probably prefer it too.

But please have a look on Hunspell GitHub page and consider how many years of active developement it took to get it where it is now. I doubt that someone could just "rewrite" it.

Anyway, for the time being existing spellcheckers are problematic and it may push away e.g. people who use VSCode to write technical docs, latex papers etc.

Having node-spellchecker accessible with compatible binaries, which can only realistically be achieved by bundling it with VSCode, would cure not even most but all of the above mentioned issues. Later when native API with comparable quality appears you can switch and it will not generate a lot of trouble for extensions developers because it is only a handful of calls.

Tyriar commented 7 years ago

I mean native as in platform; talking to macOS, Windows, Linux services if available, and falling back to another implementation if not.

bartosz-antosik commented 7 years ago

That's EXACTLY what node-spellchecker module does!

bartosz-antosik commented 7 years ago

Somehow I go impression that you tend to have everything in pure JavaScript and native modules support is not going to appear anytime soon. Sorry about misinterpretation.

Tyriar commented 7 years ago

Badly worded on my part :smile:

rebornix commented 7 years ago

First of all, sincerely many thanks to @bartosz-antosik , your work is really thorough and an awesome guidance on spell checking. I spent some time investigating into this feature this iteration and here are my thoughts and todo items.

How we ship it

Firstly, the spell checking process should work in a separate process, without blocking the core or extension host pipeline.

Secondly, there are two ways to talk to native code: node native module or standalone script with interactive console. The former is easy to do as the only catch is you have to recompile every time if node/v8 version changes. To fix that we just need to put the extension into Code's folder, either in core or in our builtin extension folder. The benefit is obvious, we don't need to talk to C++ code painfully and stay inside NodeJS always but there are several issues that should be taken care of before we do that

  1. Since we need to bundle Hunspell into Code, how does it affect our installer size/build time?
  2. Right now node-spellchecker's api is not even async.
  3. It should still be executed in a separate process. There are quite a few perf issues on Atom/node-spellchecker side.

The second way to solve this problem is running a standalone interactive script, which talks to system API, compiled in different architecture/platform. Then our NodeJS code, either Core or an extension can talk to it through standard IO or even better Socket. The script will be running in a new NodeJS process and we can easily make all the spell checking async.

I start with the second solution. Even though this problem is fixed perfectly, we still get quite a few issues around the experience and maturity of spell checking on different platforms, including but not limited to:


Spell Check API

On macOS and Windows (8 and above), the system provides builtin spell check support, their behaviors vary but they both support following common functionalities

In addition to above features, macOS, Hunspell and Windows disagree with each other on several APIs:

Ignore word

Conclusion: On Windows and Hunspell, ignore words temporarily and each time we initialize a spell check process, set the ignore list on the fly. As on macOS you can always remove words from the dictionary, let's trust it.


Builtin language support

What's the experience of setting up dictionaries for another language which has no builtin support?

How to spell check text which contains multiple languages, automatically?

System already has some native support, but they behave differently and the experience is not charming.


Dictionaries

Both Chrome and Firefox ship with en-US dictionary (for English users). Chrome will download any dictionary users require ( see https://cs.chromium.org/chromium/src/chrome/browser/spellchecker/spellcheck_hunspell_dictionary.cc?dr=C&q=chrome/dict&l=238 ), and Firefox fetches dictionaries from https://dxr.mozilla.org/mozilla-central/source/browser/app/profile/firefox.js#77.

Conclusion: Ship with en-US (because most of time you are coding in English) and maybe ship with one user preferred language (for example, maybe one day users can get a Chinese version of VS Code directly and it has Chinese dictionary builtin). For other requests, provide a stable/high available dictionary download service. Atom now downloads dictionaries from Google's service (which is used by Chrome), however that service is not available in some countries and regions.

Exception list/known Words


Spell Checker

We can ship Hunspell in all platforms and users can choose to use Hunspell or not.


Settings

Open questions about how we define the settings for spell checking.

bartosz-antosik commented 7 years ago

Thanks @rebornix for kind words & analysis which I like a lot.

I would like to refer to few points of your analysis as it looks like maybe I do not understand one or more things.

Excuse me if I am very off at points but I know very little about node.js and the whole environment.

Synchronous/Asynchronous Interface

About this sync/async interface: are events (e.g. onDidOpenTextDocument, onDidChangeTextDocument, onDidChangeVisibleTextEditors) asynchronous or not?

If they are then then why bother if node-spellchecker's interface is or is not?

If they are not then not only spell checking engine should be asynchronous but all the extension code that reacts to events to parse text & select parts to spell that calls the engine should be too, should it not?

What takes time in spelling is parsing a document, possibly large, and eliminating parts that should not be spelled (suppose latex commands or parts of code that should be skipped to spell comments & strings etc.) And I recon it should be left up to the extension, not the speller, to decide on what to do with particular document type.

There is one more thing to consider here: Word lookup is quick. Suggestions are slow.

About spellcheckers that I used they are quick to look word up to test whether it is spelled correctly and slow (like over 10 times slower on average) to produce suggestions. Current approach e.g. in my spell checker extension is to spell & feed diagnostic collection with suggestions plus there is an option to just signal misspelled words and look up suggestions on provideCodeActions event.

So do I understand correctly that either all parts of the process should be async or it does not matter much whether spelling engine is?

Ignoring Words

About custom/known/ignored words: I would consider off loading this to the extension! Don't know about the rest of the world but I would love them to be manageable like rest of the VSCode's configuration. All three MS/iOS/hunspell place them no one knows where and it is additional pain to transfer them to another location or manage them in the context of the document type.

Language Scope in a Document

I like the idea of multiple languages inside one document a lot. It seemed to me crazy at first but the more I think about it it seems quite doable. The only way though I can think of is content/comment driven language switching. Again - the extension should decide about this, as this information can be, for instance, extracted from latex document quite other way than from other document type.

rebornix commented 7 years ago

@bartosz-antosik thanks for your reply. About async/sync problem, I'm referring to function calls to native code, they are no async right now. But it's not a problem as in nodejs, we can always use setTimeout or similar to mitigate it. Not a big deal.

Word Lookup/Suggestions

I like your idea of separating word look up and generate suggestions and thanks again for your perf testing. Postponing suggestion lookup to code action provider makes sure we only do minimal calculation. And you are right, this can be an option as the only catch of this feature is users can't have a general view of misspell suggestions in Problems View.

Another thing about perf is where to do the calculation, doing all the math in native code can be faster but sending a large portion of data to native code can cost time as well. We need good testing to find the balance.

Ignoring words

System Spell Checker stores the ignoring words on the fly and yes we'll hide them from users.

Multi language

macOS has its in-house language detect which works reasonable to me but Windows doesn't. Comments, strings and technical documents are the most possible cases that users may need multi-language support. We can either switch languages automatically, or maybe even spawn multiple spell check process for different languages.

Jason3S commented 7 years ago

Hello

I'm the author of Code Spell Checker extension and cspell linter (used by the extension).

Why

I did not intend to write a spell checker. I wrote it because I needed one that worked with source code and didn't find a built in checker. So the fact that you are considering having a spell checker built in is wonderful. It would have saved me a bunch of effort. :-)

To be honest, it was a fun exercise. It needed to load fast and execute fast. It needed to limit memory consumption and work with very large dictionaries. Spelling suggestions needed to be quick and applicable. Importantly, I wanted it to run on all platforms. I was able to achieve all of these things.

How it works

I did not choose any of the Hunspell solutions due to speed and memory concerns. The Hunspell format is designed for compact representation of words with common prefix and suffix patterns. The Hunspell .dic and .aff are deliberately easy for adding words by hand. The format is not designed for easy lookup or searching. Which is why the open source javascript solutions are very slow and use a lot of memory.

Instead I wrote a hunspell file reader that would output all the word combinations. This list of words is compiled into a compact format designed for lookup speed and calculating suggestions. At its core is a Trie which is optimized into a Deterministic Acyclic Finite State Automaton.

This process of compiling is rather expensive, which is why it is done offline and only the compiled dictionaries are shipped with the extension.

Word Lookup and Suggestions

Word lookup is O(m) where m is the length of the word. It is a very simple process of walking the Trie. Suggestions are done using a modified Levenshtein algorithm that minimizes recalculation and culls candidates by not walking down branches in the Trie whose minimum possible error is greater than the allowed error threshold.

Things to consider

Most of the work was not writing the spell checker. Checking words and making spelling suggestions is rather easy. Most of the work came from the configuration options. Where possible, the system is configuration driven.

Each programming language has its own combination of dictionaries and settings. In the linter fashion, the spell checker also allows for in code flags and settings.

Programming Language Dictionaries

I ended up creating dictionaries that included keywords and common symbols for several programming languages. These dictionaries can be combined based upon the context.

For example a .cpp file will use the following dictionaries: cpp, companies, softwareTerms, misc, filetypes, and wordsEn.

As you can see, I even needed a dictionary for common software terms, because standard Hunspell dictionaries do not include most software terms.

Programming Language Grammar awareness

I did not make my spell checker aware of the programming language grammar or syntax. There are some really cool things that are possible. Like having strings be in French while the code is in English and the comments are in Spanish. Other things like not spell checking 3rd party imports. Yet, I found this more work than I had time to spend.

As an extension writer, I was wishing for access to the language grammar used by the colorizers.

Linter Style

I think it is worth noting that a spell checker is usable in a Continuous Integration environment. Think of it as anyplace you might want to use tslint a spell checker might be useful.

Jason3S commented 7 years ago

Questions

  1. How do you plan on parsing the code to send it to the spell checker? Spell checkers do not like camelCase or snake_case.
  2. How do you plan on solving the multi language issue? Where the code and comments are in English while the strings are in Spanish?
  3. What is the plan for project and user level word lists?
  4. If a users adds their own words to the dictionary, will they be included in the suggestions?
  5. Reading the discussion, it looks like the plan is to call the spell checker one word at a time. Won't that be very slow?
JM-Mendez commented 7 years ago

@rebornix sorry to jump into this old convo, but node-spellchecker has the following undocumented features. I haven't used this module yet, so I don't know if it's stable. But I'm assuming their documentation just isn't updated because the remove api has been there since Mar '16 according to the commit history.

undocumented features:

https://github.com/atom/node-spellchecker/blob/master/lib/spellchecker.js

bartosz-antosik commented 7 years ago

@rebornix As we are reviving the long dead (and as it seems not very important) thread then I could also drop an update, that my extension mentioned in first post Spell Right, based on carefully used node-spellchecker, has gone multiplatform some months ago and it seems it is working well for people around, now on Linux and macOS to. Both on regular and insiders builds.

matklad commented 5 years ago

How do you plan on parsing the code to send it to the spell checker? Spell checkers do not like camelCase or snake_case.

Note that this is programming-language dependent, and, for this reason, it makes sense to make spellchecker itself part of the platform, and expose language-dependant parts via LSP. Here's a list of things which could be handled by language server but can't be reasonably handled by spell checker extension alone:

For markup langauges, dealing with subwork markup. For example, in asciidoctor I can write **A**plicaton to make the first word bold, and it'd be cool if spellchecker saw this as an error.

For all languages, langauge server needs to unescape string literals and strip // from comments.

For all languages, there should be a language-specifc built-in dictionary

For statically typed languages, spell checking should be done only for definitions, and not for references: catching misspellings in the references is the job of compiler and code completion.

FDiskas commented 5 years ago

I'm sorry - but why not to use chrome internal spell checker? There is a good library to help implement that https://www.npmjs.com/package/electron-spellchecker

elcste commented 4 years ago

Electron 8 includes support for the built-in Chromium spellchecker. Maybe now this feature would be easier?

borekb commented 4 years ago

This looks like a primary issue for built-in spell checking in VSCode so if it's going to happen with the new Electron 8.0 capabilities, I'd like to add a few notes:

oschulz commented 4 years ago

I guess the improved spell-checking capabilities of Electron v9.0 would be an ideal basis for VS-Code built-in spell-checking? I would love to have that - haven't found a reliable spell-checking extension yet that works under VS-code remote development.

alanlivio commented 3 years ago

Microsoft also has "Microsoft Editor Service" which work for both browser and desktop. Is there any way to use it in vscode?

Lemmingh commented 2 years ago

The discussion above about how to ship a spell checker appears not concluded. What about WASM? All major engines have been supporting WASM since 2017 according to the MDN compatibility data.

Someone has successfully compiled Hunspell as WASM: https://github.com/kwonoj/hunspell-asm . The Base64-encoded WASM binary of Hunspell is only about 780 kB, so there should be little difficulty in bundling.

Talia-K-Loos commented 2 years ago

+1

I came here to say this. Just a selectable spelling dictionary would do for me, even.

I'd use it for text files, markdown files, and most especially for files that are of the "git commit" language type.

Pindar777 commented 2 years ago

Interesting discussion! I'm fond of https://marketplace.visualstudio.com/items?itemName=valentjn.vscode-ltex It is very helpful but takes a huge amount of storage.

AshleyT3 commented 12 months ago

A mild +1 for at least rudimentary VSCode spellcheck out of the box if it seems reasonable given overall user asks. An office-like app has great spellcheck but won't start due to license check requirements if it has been offline for a long time. I prefer simple text files to avoid heavy client issues like that. VSCode supports this but without spellchecking out of the box. For certain note-taking cases, I look elsewhere... or perhaps copy/paste to office app w/spellcheck next chance. While +1 one this, it is not a push as though I'm waiting with anticipation for this... VSCode gets tons of usage in so many areas... I'd hardly complain about where it is at today... so a mild +1 if there happens to be tons of others who +1 and it makes overall sense. Hope this helps, thanks.