Typo checker integration

opensistemas-hub / osbrain

osBrain - A general-purpose multi-agent system module written in Python

https://osbrain.readthedocs.io/en/stable/

Apache License 2.0

174 stars 43 forks source link

Typo checker integration #268

Open Peque opened 6 years ago

Peque commented 6 years ago

Maybe we could integrate a typo-checker in the test suite. I know they exist, but have not look at it.

I have previously used checkpatch.pl in non-Python projects, which is used in the Linux kernel, but hopefully there is a nice Python package already out there... :smile:

Due to the fact that I frequently make typos when writing in English, this might be important... And I will learn English too! :joy:

ocaballeror commented 6 years ago

I've been looking into this for a while, but didn't find much.

The ideal thing would be a plugin for pytest that could do the job, but I didn't find any, which is kind of a bummer.

The closest thing I found to what I had in mind is scspell, which is run as an external tool, but it will take a bit of work, since we would need to create a personalized dictionary with all the specific words we use (osbrain, pytest, nameserver...).

I may revisit this in the future, but it's not really a priority right now.

ocaballeror commented 6 years ago

I ran a manual check with scspell and went through the list of results, picking out manually which ones were actual typos. The results are in #285 .

There was no configuration involved, which means most of the results it reported were simply unknown words and weird variable names, which is the main reason why this would take a while to implement. I could have created a custom dictionary with the words that it should recognize as valid, but that will be hard to maintain in the future, and will probably create lots of commits that are simply named "updated dictionary".

Unless we find a way to store the dictionary file outside of the repo, this doesn't look very promising.

Peque commented 6 years ago

Yeah, I was thinking more about the approach in Linux's checkpatch.pl. They do not have a dictionary with all the valid English words, instead, they have a dictionary with common typos, so they only report an error when it is very likely an error.

In order to tokenize everything we could split text by any non-letter character (spaces, numeric, underscores...) and check against the common-typos dictionary. We could go further (thinking about class names) and split on camel-case words.

This would be awesome to have as a separate package, maybe integrated with flake8. I am currently busy with other projects in my spare time, but I might, at some point, spend some time with it. I do not think that would be before june though... :joy:

ocaballeror commented 6 years ago

Yeah, that would be awesome!

I'm still kind of surprised nobody has done anything like this before. I mean, somebody else has had to run into the same issue at some point.

Peque commented 6 years ago

@ocaballeror Literally everybody else. I know no project without a "Fix typo" commit in their history. :joy:

ocaballeror commented 6 years ago

https://github.com/search?q=fix+typo&type=Commits

I just looked it up and github reports 54,166,865 commit messages with the words "fix typo" :laughing: :laughing:

By the way, when doing that search I also stumbled upon this: https://github.com/intgr/topy It looks better than scspell, since it works by recognizing common typos instead of using a standard dictionary.

I just tried it out and it looks very easy to integrate. Just run one command and it generates a patch for your projects with the typo corrections. You can even tell it to apply those corrections automatically if you want. The best thing is it only reported a couple of false positives, so it looks reliable enough in my opinion.