innovationOUtside / nb_spellchecker

Simple tool to try support spell checking of Jupyter notebooks
MIT License
0 stars 0 forks source link

Explore other spellcheckers #7

Open psychemedia opened 3 years ago

psychemedia commented 3 years ago

The pyspelling package provides and architecture for parsing and spellchecking a variety of document types (markdown, pyhton, HTML etc) and filtering different objects (eg URLs). But it's a bit tricky to get your head round.

Could we instead create our own pipeline, eg using something like spacy and the contextualSpellCheck pipeline element?

A simple learning task in spacy pipeline step creation might be to try to develop a simple duplicate word detector.

For other possible spell checkers, see:

psychemedia commented 3 years ago

Working with spacy, following tokenisation, tokens are labeled with flags that identify them as punctuation, URLs etc. which we can then filter out.

I wonder if we could also identify and tag things like python package names eg that are enclosed in single backticks?

psychemedia commented 3 years ago

codespell looks interesting - provides an "obvious" way of implementing your own rules and lookups and seems to run okay against md cells... [<- not sue what I meant by that? I can run it from command line but can we call it from py too?] Would be interesting to try to tweak this to run different rule sets for md and code and markdown-code fence blocks?

codespell maintains a big list of common typos and matches against those, making fix suggestions. This means it only finds typos it knows about - but if it's easy enough to add additional terms, we can start to build out own additional lookups and single word style rule fixes.

Simple command line usage: codespell ./*/*.ipynb > codespell.txt

This seems to run really quickly.

psychemedia commented 3 years ago

This could be useful, particular if we can find a way to add additional rules? languagetool (Java application) and a py wrapper for it: jxmorris12/language_tool_python.

# jxmorris12/language_tool_python
import language_tool_python
tool = language_tool_python.LanguageTool('en-US')
text = 'A sentence with a error in the Hitchhiker’s Guide tot he Galaxy'
matches = tool.check(text)

The Findus23/pyLanguagetool has a Python wrapper for the JSON API (docs) that could be used to check individual cells (or spellcheck on an .md version of a notebook created via jupytext etc). Note that this does not seem to bundle a languagetool server.

See also Code Fragment: Highlighting Typos

psychemedia commented 3 years ago

pylint has a spelling checker that I think checks thinks like comments? docs

eg something like pylint --disable all --enable spelling --spelling-dict en_US test.py

This might also need enchant and pyenchant?