proposal: a simplified and generalized invalid-name config

pylint-dev / pylint

It's not just a linter that annoys you!

https://pylint.readthedocs.io/en/latest/

GNU General Public License v2.0

5.27k stars 1.13k forks source link

proposal: a simplified and generalized invalid-name config #7305

Open bukzor opened 2 years ago

bukzor commented 2 years ago

Current problem

The current configuration system for naming style in pylint can be seen as simultaneously overly rigid and overly flexible. There are two distinct ways to configure naming schemes. Primarily there is a named-scheme selection where users choose among lower_case UPPER_CASE PascalCase, etc., and there is a regex-based (-rgx) configuration where the user meant to write a regular expression that matches all (and only) permissible variable names.

As a concrete example:

[BASIC]
const-naming-style=UPPER_CASE
# from https://github.com/openqasm/openqasm/blob/main/.pylintrc
variable-rgx=[a-z_][a-z0-9_]{2,30}$

The named-style is convenient for users, but gives no clarity as to what exactly will match, and admits no ability to make small tweaks to its definition. The regex configuration is its near opposite, being completely unambiguous and easy to modify but painful-to-impossible to use for the average human.

Desired solution

This is my (tentative!) proposal: (keep in mind that all details will/would be modified to reflect maintainers' requirements)

Let's move this configuration out of "basic" to a new, separate "naming" namespace.

[NAMING]
variable = lower_case

[NAMING.lower_case]
allowed-characters = [a-zA-Z0-9_]
allowed-first-character = [a-z]
min-length = 3
max-length  = 0  # no limit

Or, alternatively: (I prefer the above)

[NAMING]
upper-case.allowed-characters = [A-Z0-9_]
upper-case.allowed-first-character = [A-Z]
upper-case.min-length = 0  # no limit
upper-case.max-length  = 30

Each naming style is be defined by just four characteristics: allowed characters, allowed first-letter character, minimum and maximum length. If we can expose those quadruplets to the configuration, then the configuration is easy-to-use (users can largely ignore these fine details), unambiguous, easy to modify, and there's no (obvious) need to expose users to the full footgun of regular expressions.

Note: The "character" configs are required to be a single character-class (i.e. a single-char regex beginning and ending with square brackets). This limitation allows excellent error messaging in the case of a invalid-name error, but also enables programmatic use.

This obviously enables the potential (but not requirement) to allow users to define custom named naming-schemes.

All naming schemes would then be defined as (approximately) f'_{{0,2}}{allowed-first-character}{allowed-characters}{{{min-length},{max-length}}}'

I believe the regular-expression-based configuration could then be phased out entirely, but could of course be retained if wanted as a fifth, optional regex attribute of the naming scheme with default None.

Additional context

This idea was split out of #3704, which considers revamping the invalid-name checker.

I (@bukzor) am volunteering to implement this change, if ratified.

DanielNoord commented 2 years ago

Thanks for the clear proposal @bukzor!

I'm not sure though whether I agree with the changes you propose though. I understand that regular expressions can be hard to grasp for (new) developers, but adding to the complexity by adding 4 new subsettings for each type of "nameable" doesn't seem like it solves the issue of being "overly rigid and overly flexible". Why would we give our users less possibilities (you can no longer write a regular expression yourself) only to have them avoid writing a regular expression once?

Pierre-Sassoulas commented 2 years ago

Thank you for the recap @bukzor.

I think building on what exists right now and fixing the current issues is safer in this case. Redesigning something entirely is a very risky endeavor. We have insight on what is wrong with what we currently have, but not what will be wrong on something we design from scratch. Seeing how much pylint is used there will be something wrong. Every-time we add a new checker we can expect issues to pop up. This is the case for checker affecting a particular statement like consider-using-with that trigger only on context manager... invalid-name is triggering on every name and we're going to miss a dozen edge cases whatever we do (especially with 4 new options added).

I think the major pain point of invalid-name is that handling short name is hard with a regex and the regex also hide the reason why the name was rejected (#2018, the second most upvoted issue in the repo and the message that personally irk me every time I create a new project that prevent me from using pylint without configuration and then prevent me from actually checking for proper PEP8 naming).

So in my opinion what we need is fix #2018 the simplest way we can : For that we need to separate short name and a regex that does not match logically in the code. This is already a lot of work to get it right, as we need to not break any existing disable/enable, refactor a lot of code and functional tests and upgrade the regex for all the name style. I started to work on it for maybe 10 hours before my disk crashed and loosing the code and it was half done at best.

Mostly copy pasting my comment from the previous discussion:

A name that is in the list of accepted name should be ok for both checks.
We need to decide if we raise both invalid-name and name-too-short for name that are violating both. I think we should raise both for consistency but the regex need to be updated as they're not matching valid name that are too short right now.
Also If we raise both we need to decide what happens for name that are short and ambiguous (A is both UPPERCASE, UPPERCASE_WITH_UNDERSCORE and PascalCase, a is both camelCase and snake_case). Right now I think we should allow both. Also ab should be valid snake case but too short, and aB should be invalid and too short, Ab should be valid PascalCase, but not ab.

snake_case name	A	a	Ab	ab	Abcd	abcd
invalid-name	yes	no	yes	no	yes	no
name-too-short	yes	yes	yes	yes	no	no

We need to decide what happens when the user is not using the default regex (maybe they handled name too short in it already?)
There's also name-too-longfor name > 30 characters that was included in the regex and I forgot about it because I never hit it myself

bukzor commented 2 years ago

Thanks for consideration. You can close this if you like. I may reopen this discussion when/if other changes have already landed.

bukzor commented 2 years ago

@DanielNoord

Perhaps what I failed to explain is that these four "subsettings" of each named style already exist, in the current main branch. It's only that they're not factored as such and not exposed to the configuration system. I intend to (with your approval, as a successor to #3704 ) refactor as such, so the only change proposed here is to expose those hard-coded parameters as default configuration.

regular expressions can be hard to grasp for (new) developers

I believe you underestimate this issue dramatically. I estimate more than 75% of self-described "experienced developers" would fail to write a bug free regex for one of these, given three attempts. While regular expressions are very compact, each token (very nearly each character) doubles the state space, which gets beyond human capability very quickly. Consider the number of responsibilities and edge cases handled by the default name regular expression: ([^\W\dA-Z][^\WA-Z]{2,}|_[^\WA-Z]*|__[^\WA-Z\d_][^\WA-Z]+__)$. I think if you asked ten people to count them you'd get ten answers.

I've probably overstated this point dramatically, but I'm only trying to be clear :)

... adding to the complexity by adding 4 new subsettings for each type of "nameable" ...

I believe the total complexity remains the same. It's only that complexity has been shifted away from the user and into the system (where it ideally belongs). Each of the four new subsettings is entirely simple (has a singular, obvious purpose, as opposed to complex).

... doesn't seem like it solves the issue of being "overly rigid and overly flexible".

By boiling down the naming schemes to their essential, atomic parameters and leaving all other considerations to the framework, the system is quite flexible (all current naming schemes and millions more can be represented) but not overly so: it's no longer possible to accidentally disallow __init__ as a module, and we gain the (optional) ability to verify that the user hasn't allowed numbers as the first character or dashes in the name, etc.

you can no longer write a regular expression yourself

That's only an optional part of this proposal. It's not necessary to remove the -rgx configuration. I only meant to say that I believe it would serve no purpose and so could be removed.

That said, the user-facing gains are relatively minor and the implementation cost is not.

bukzor commented 2 years ago

@Pierre-Sassoulas

we're going to miss a dozen edge cases whatever we do (especially with 4 new options added).

I think you've neglected to consider that the four proposed options have just four degrees of freedom, while the current regular expressions have (much more than) thousands of degrees of freedom. This means that in the proposed scheme it's much easier to consider (and test) the various possiblities, because there are so many orders of magnitude fewer.

I think the major pain point of invalid-name is that handling short name is hard with a regex and the regex also hide the reason why the name was rejected (#2018, the second most upvoted issue in the repo and the message that personally irk me every time I create a new project that prevent me from using pylint without configuration and then prevent me from actually checking for proper PEP8 naming).

I didn't add this because I thought it was incidental, and the proposal already got long, but this proposal fixes #2018 very neatly. Because allowed-letters and allowed-first-letter are parameters (e.g. allowed-letters=[a-z0-9_] allowed-first-letter=[a-z] for snake_case) we can provide extremely clear and specific errors now, such as the parameter name 'Baz' does not conform to snake_case style: the first letter 'B' must be in the range '[a-z]' or the constant name 'MyNumber' does not conform to UPPER_CASE style: character 2 'y' is not in range '[A-Z_]'. You could of course get extra fancy with coloring and ascii-art, but I believe this would be enough for you to become un-irked.

we need to separate short name and a regex that does not match logically in the code. This is already a lot of work to get it right

I could work on that before the rest, then, if you like.

DanielNoord commented 2 years ago

I think your reference to degrees of freedom is very helpful for the discussion. What I'm not particularly keen on in your current proposal is that we remove the option with the unlimited degrees of freedom. I think that customisability is a crucial part of pylint and should remain. That said, I can see the value in reducing the amounts of freedom for regular users.

Do you think there would be a way to have -rgx and one or two other options achieve the same? Currently we have 12 types of "nameables". With this proposal we would add 12*4=48 new options, which imo is a bit much. Note that we can't really use configuration namespaces due to the old config types we also need to support so they will need to be --typevar-naming-min-length or something like that.

Pierre-Sassoulas commented 2 years ago

I believe the total complexity remains the same. It's only that complexity has been shifted away from the user and into the system (where it ideally belongs).

Yeah, that and the degree of freedom (that we were dumping on the user so it's not "our" problem) convinced me. Also two small regexes still give a as much freedom, unless I'm mistaken. Nothing prevent someone from just dumping their arbitrarily complex regex in the allowed characters option, they'll just have to migrate the first letter handling in the first letter option and the length handling in the length options.

we can provide extremely clear and specific errors now

It would be great for sure.

Currently we have 12 types of "nameables". With this proposal we would add 12*4=48 new options, which imo is a bit much.

As long as there are default values for PEP8 and most users do not use those options, I don't think it's an issue.

Maybe at first we could start by:

Adding the too-[short|long]-name messages and options
Remove the length part of the default regexes
Make those options mutually exclusive with old options that are incompatible
(Warn the user that use quantifier {3-30} in their non default regex on how to migrate?)

Then later (next minor?):

Add the first letter and allowed letters options
Make them mutually exclusive with old options that are incompatible
Make the new options the default
Deprecate the old regex option in favor of the two new simpler regexes
Warn the user that do not use the default value with an explanation message on how to migrate

Then in 3.0:

remove the old regex as a breaking change.

I think we need to maintain the old regex for some time but maintaining 36 regexes forever is not something I look forward to 😄

Pierre-Sassoulas commented 2 years ago

I just though that the regex allowed to have different length criteria for test functions (i.e. allow super verbose function name if it starts by test_). It's possible to keep doing that with new options by checking that allowed letter start by est_ but it's hacky.

I'm probably missing other use cases. Maybe an argument to keep the old regex and not deprecate.

Wehzie commented 2 years ago

Could pylint adapt naming conventions of mainstream packages and add those as defaults to an exception list whenever these are imported?

For example, consider the purpose of data science and machine learning, as in this issue.

A typical data science project will use some of the following projects, pandas -> add "df" maplotlib -> add "ax" sklearn -> add "X" and "y" scipy -> add "f", ..., "fn", "t", "x", "y"

In case a user wants scipy defaults, except for example "f", there should be a possibility for that.