smithy-lang / smithy

Smithy is a protocol-agnostic interface definition language and set of tools for generating clients, servers, and documentation for any programming language.
https://smithy.io
Apache License 2.0
1.76k stars 206 forks source link

Add spell check linter #668

Open mtdowling opened 3 years ago

mtdowling commented 3 years ago

It would be nice to have a spell check model linter that looks for spelling issues in shape names, member names, documentation, and maybe any string. It might be hard to do any string, but it would be interesting to see if it's possible and/or worthwhile (i.e., it could lead to severe performance issues and too many false-positives). This linter should have a default dictionary that can be appended to using a custom newline separated string that contains words. Custom words are a hard requirement since most models use domain specific terminology that isn't feasible to capture in the default list of words. The spell checker doesn't necessarily need to offer spelling suggestions, so that likely makes it easier to implement. Sentences would need to be broken down into individual words by tokenizing strings based on things like " ", "-", ",", ".", ";", ":", "_", etc.

The best dictionary I know of is https://github.com/dwyl/english-words, though the license is unclear, and we'll need to filter out bad words. The dictionary is around 4 MB, so we'll need to make sure we don't have to load the file repeatedly or store it in memory multiple times.

JordonPhillips commented 3 years ago

we'll need to filter out bad words

This is remarkably harder than you'd think. I've yet to find a list that's able to filter out everything just in the listed repo, and even applying stemming techniques only gets you so far.

PatMyron commented 3 years ago

Ignoring regular expression patterns in addition to dictionaries of specific words is critical