Open psychemedia opened 3 years ago
Working with spacy
, following tokenisation, tokens are labeled with flags that identify them as punctuation, URLs etc. which we can then filter out.
I wonder if we could also identify and tag things like python package names eg that are enclosed in single backticks?
codespell
looks interesting - provides an "obvious" way of implementing your own rules and lookups and seems to run okay against md cells... [<- not sue what I meant by that? I can run it from command line but can we call it from py too?] Would be interesting to try to tweak this to run different rule sets for md and code and markdown-code fence blocks?
codespell
maintains a big list of common typos and matches against those, making fix suggestions. This means it only finds typos it knows about - but if it's easy enough to add additional terms, we can start to build out own additional lookups and single word style rule fixes.
Simple command line usage: codespell ./*/*.ipynb > codespell.txt
This seems to run really quickly.
This could be useful, particular if we can find a way to add additional rules? languagetool
(Java application) and a py wrapper for it: jxmorris12/language_tool_python
.
# jxmorris12/language_tool_python
import language_tool_python
tool = language_tool_python.LanguageTool('en-US')
text = 'A sentence with a error in the Hitchhiker’s Guide tot he Galaxy'
matches = tool.check(text)
The Findus23/pyLanguagetool
has a Python wrapper for the JSON API (docs) that could be used to check individual cells (or spellcheck on an .md version of a notebook created via jupytext etc). Note that this does not seem to bundle a languagetool
server.
See also Code Fragment: Highlighting Typos
pylint
has a spelling checker that I think checks thinks like comments? docs
eg something like pylint --disable all --enable spelling --spelling-dict en_US test.py
This might also need enchant
and pyenchant
?
The
pyspelling
package provides and architecture for parsing and spellchecking a variety of document types (markdown, pyhton, HTML etc) and filtering different objects (eg URLs). But it's a bit tricky to get your head round.Could we instead create our own pipeline, eg using something like
spacy
and thecontextualSpellCheck
pipeline element?A simple learning task in
spacy
pipeline step creation might be to try to develop a simple duplicate word detector.For other possible spell checkers, see:
spylls
: Python implementation ofhunspell
;pyenchant
, Python bindings for the Enchant spellchecker (Enchant: "a library (and command-line program) that wraps a number of different spelling libraries and programs with a consistent interface"); example:manual_spellchecker