Feature request: support multiple locales in spelling-dict option

pylint-dev / pylint

It's not just a linter that annoys you!

https://pylint.readthedocs.io/en/latest/

GNU General Public License v2.0

5.32k stars 1.14k forks source link

Feature request: support multiple locales in spelling-dict option #8083

Open jayaddison opened 1 year ago

jayaddison commented 1 year ago

Current problem

Recently, spellchecking the word onwards using an en_US failed during a pylint.git continuous integration workflow; it would likely succeed with most en_GB dictionaries.

The problem is more general, though: codebases may have contributors who read and write using a variety of locales.

Desired solution

The spelling-dict option could be extended to accept multiple locale options to spellcheck against.

Additional context

pyenchant supports multi-language spellchecking.

Pierre-Sassoulas commented 1 year ago

I like this one, we already had the issue multiple time with behavior, color, etc.

jayaddison commented 1 year ago

Thanks @Pierre-Sassoulas - yep, it would be a nice (low priority) improvement. I've been experimenting with implementing it locally, and made some progress, but have found some odd argparse / optparse behaviour and am not certain whether I'll contribute an implementation soon.

Pierre-Sassoulas commented 1 year ago

have found some odd argparse / optparse behaviour

It would be easier if pylint's input needs to exactly match pyenchant input and is passed to it directly (so if pyenchant handle multidict, then giving a string with multi dict works). Maybe we need to remove our own argparsing, parse spelling args with pyenchant on a dummy text during argparsing and before launching our checks, to be able to display the errors raised by pyenchant.

jayaddison commented 1 year ago

It seems that although pyenchant has support for multiple language dictionaries as datasources, it can only spellcheck using one-at-a-time.

Also, the API that pyenchant provides includes tokenization -- which could be relevant because we would need to decide how to update the existing pylint spellcheck tokenization.

Some other Python spellcheck libraries like symspellpy support construction of multi-language spellchecker objects -- but the dictionary file(s) used would be different, so a change in library could result in spellcheck differences for pylint users.

It would be easier if pylint's input needs to exactly match pyenchant input and is passed to it directly (so if pyenchant handle multidict, then giving a string with multi dict works). Maybe we need to remove our own argparsing, parse spelling args with pyenchant on a dummy text during argparsing and before launching our checks, to be able to display the errors raised by pyenchant.

@Pierre-Sassoulas What would be the purpose of the dummy text?

Pierre-Sassoulas commented 1 year ago

Not sure what I meant by "dummy text" at the time but looking at the problem again I suppose one way to rephrase it would be "text without signification that we would not try to interpret during configuration parsing", i.e. pylint just being a middleware between user config and pyenchant.

jayaddison commented 1 year ago

If I understand you correctly, that could mean decoupling the choice validation from enchant known-dictionary list?

That could make sense. It seems unclear whether the existing spelling-dict option (and choice validation) are a good place for this.

A few things I've learned from prototyping an implementation:

The feature probably doesn't make sense across multiple languages. Multiple locales, perhaps - but only within a single language.
- It likely doesn't make sense to tokenize each input docstrings/comments more than once.
- Suggestions (corrections offered from the dictionary) from multiple languages could be confusing.

Also:

Compatibility of the configuration file formats is important for migration purposes, and it's not immediately obvious how to approach that for a scalar value that would become multi-valued.
- The ini-format (pylintrc) config files tend to use csv option type for multi-valued items.
- The csv option type doesn't easily interoperate with choices.

I'm becoming less convinced that this is a worthwhile feature to implement. A single locale to spellcheck during each pylint run is simple implementation-wise, configuration-wise and also probably easiest to understand.

It might make more sense for users who would like to accept alternate spellings to consider using the existing custom-word-list feature, or to merge dictionary source files locally. Alternatively it may be possible to use the multi-dictionary support in aspell (and perhaps other spellcheck providers) from enchant -- I haven't confirmed that, though.

Pierre-Sassoulas commented 1 year ago

Suggestions (corrections offered from the dictionary) from multiple languages could be confusing.

Unless the dict are en_us and en_gb, or fr, and fr_be or similar which I think are the only valid reasons to use multiple dicts

If I understand you correctly, that could mean decoupling the choice validation from enchant known-dictionary list?

If we want to be fancy about validations and if I understand correctly what pyenchant is doing, we'd need the ','.join(itertools product) of what we have now without the empty string, + what we have now. (i.e: if we had ["", "en_us", "en_gb", "fr"] we now need ','.join(itertools.product(v for f in ["", "en_us", "en_gb", "fr"] if v) + ["", "en_us", "en_gb", "fr"]... But it only work for combination of 2 dicts not more ! (en_us, en_gb, not en_us, en_gb, fr). If we want to provide all the possible combinations we need to redo this n time with n the number of dict available in pyenchant. This feel overly complicated, but it's possible. It's probably better to not provide a "choices" field in argparse and just validate that each element of a csv list of pyenchant are all valids installed pyenchant dict. (Or not provide choices, not validate either, and catch pyenchant errors if/when they happen, which is what I'm suggesting 😄). This would also fix the annoying update of the doc depending on installed pyenchant dict locally. (https://github.com/pylint-dev/pylint/pull/8638#discussion_r1181248636)

Compatibility of the configuration file formats is important for migration purposes

Sure but we did not allow multiple values before, so the current valid configurations all contains valid comma separated list of exactly one value, right ?

jayaddison commented 1 year ago

It's probably better to not provide a "choices" field in argparse and just validate that each element of a csv list of pyenchant are all valids installed pyenchant dict. (Or not provide choices, not validate either, and catch pyenchant errors if/when they happen, which is what I'm suggesting :smile:). This would also fix the annoying update of the doc depending on installed pyenchant dict locally. (https://github.com/pylint-dev/pylint/pull/8638#discussion_r1181248636)

Compatibility of the configuration file formats is important for migration purposes

Sure but we did not allow multiple values before, so the current valid configurations all contains valid comma separated list of exactly one value, right ?

Yep, OK, that all makes sense, and using unvalidated CSV options would be a straightforward, backwards-compatible approach.

The self-documenting behaviour of the dictionary choices seems potentially useful in some cases (except when causing unexpected source control changes!). Perhaps it's more valuable at the command-line than in the autogenerated documentation, though.

I have a work-in-progress branch that attempts to maintain the choices validation while providing multi-locale-within-one-language (en-US, en-CA ok, fr-FR, fr-CA ok; xx-AA, yy-AA not ok). I could open that as a draft pull request if that'd be worthwhile?

Pierre-Sassoulas commented 1 year ago

I have a work-in-progress branch that attempts to maintain the choices validation while providing multi-locale-within-one-language (en-US, en-CA ok, fr-FR, fr-CA ok; xx-AA, yy-AA not ok). I could open that as a draft pull request if that'd be worthwhile?

Sure, that would be worthwhile, thank you for working on this, much appreciated ! I think if someone want to mix fr/en or es/en we shouldn't prevent them from doing it. But we should prevent pyenchant crashes that would also make pylint crash if the input given to pyenchant is invalid. Do not hesitate to open a draft merge request so we can discuss the code directly :)