pylint-dev / pylint

It's not just a linter that annoys you!
https://pylint.readthedocs.io/en/latest/
GNU General Public License v2.0
5.28k stars 1.13k forks source link

Warn against unicode identifier normalization #4906

Open ickc opened 3 years ago

ickc commented 3 years ago

Current problem

Unicode normalization in unicode identifier. c.f.

The issue is that while some characters can be used as identifier, internally it is normalized. So when looked up using __all__ or getattr, it would results in an error. (See the 2nd & 3rd link for respective example.)

Desired solution

pylint can warns about an identifier when normalized is different, and suggested a normalized identifier instead. i.e. it notifies that the identifier will be converted internally silently.

Similarly, pylint can warns those inside __all__ and getattr too, as they can never be looked up due to no identifier would have such an unnormalized string.

Additional context

Edit: keys of globals() & locals() also shares the same problem:

Keys to check

Pierre-Sassoulas commented 3 years ago

Thank you, I did not know about this behavior, it makes a lot of sense to add a check for that in pylint. Would implicit-identifier-normalization work as a name ?

ickc commented 3 years ago

That sounds good. Thanks! I added a list of the possible places this happens at the end of the original post above. Feel free to add more if you found some missing.

JulienPalard commented 2 years ago

More context: https://lwn.net/ml/python-dev/85d2c121-86a9-e04a-6dfa-da36b073a95d@gmail.com/

In the following case I'd say it's OK to warn about using a not-normalized identifier, it's an easy fix for the dev:

fi_fou = 42  # Using the ligature U+FB01 LATIN SMALL LIGATURE FI
print(fi_fou)  # Using only codepoints < 127

In the following case it's more complicated to fix as the two looks identical, it's probably the case where normalization just plays its role. Bad thing, the human users with keyboards will input the 1st one, while the NFKD is the 2nd one, so it's not (easily?) fixable by using a keyboard:

étage = 42  # Using U+00E9 LATIN SMALL LETTER E WITH ACUTE
print(étage)  # Using U+0065 LATIN SMALL LETTER E then U+0301 COMBINING ACUTE ACCENT

But it can be detected as legitimate as it "rounds-trip" via canonical composition (examples below).

In the following one, warning about normalisation is clearly a good idea, as it leads to a "visual bug":

>>> ϕ = 1
>>> φ = 2
>>> print(ϕ, φ)
2, 2

About the 'it's legit if it rounds trip" here's a demo of what I have in mind:

def is_legit(name):
    """Returns True if the name is already in the expected denormalized form,
    None if it's not but acceptable, and False if not. 
    """
    if unicodedata.normalize("NFKD", name) == name:
        return True
    if unicodedata.normalize("NFKC", unicodedata.normalize("NFKD", name)) == name:
        return None
    return False

Which gives:

>>> is_legit("abc123")  # A normal, plain, old, ASCII identifier
True
>>> print(is_legit("été"))  # That's how a french keyboard would input été, in a composed form
None # means it's not normalized but still acceptable.
>>> print(is_legit("étage")) # This is the denormalized form
True # But I don't know how to input this with a keyboard :D
>>> print(is_legit("étage")) # This is the composed form ("keyboard" / "user" input form)
None # The way user write it is acceptable, youhouuuuuuuu!
>>> print(is_legit("ϕ")) # This one is the one introducing a "visual" bug in my last example
False # Note this is the only one refused by my function.
>>> print(is_legit("φ")) # It's normalized form.
True
Pierre-Sassoulas commented 2 years ago

From @JulienPalard example, following the amazing work on the unicode checker by @CarliJoy there's only one false negative left as pylint now warns about non ascii name:

fi_fou = 42  # Using the ligature U+FB01 LATIN SMALL LIGATURE FI
print(fi_fou)  # Using only codepoints < 127
étage = 42  # Using U+00E9 LATIN SMALL LETTER E WITH ACUTE
print(étage)  # Using U+0065 LATIN SMALL LETTER E then U+0301 COMBINING ACUTE ACCENT
ϕ = 1
φ = 2
print(ϕ, φ)
a.py:5:0: C2401: Variable name "étage" contains a non-ASCII character, consider renaming it. (non-ascii-name)
a.py:7:0: C2401: Variable name "φ" contains a non-ASCII character, consider renaming it. (non-ascii-name)
a.py:8:0: C2401: Variable name "φ" contains a non-ASCII character, consider renaming it. (non-ascii-name)

The first example with LATIN SMALL LIGATURE FI is still not detected by pylint. I'm pretty sure the unicode checker will be easy to extends to fix this as it's pretty modular and well designed.

CarliJoy commented 2 years ago

If someone want to fix this issue, feel free to contact me.

CarliJoy commented 2 years ago

I had a look into it. I don't think this is solvable with acceptable effort.

The issues is that Python normalizes already when defining the variable: image

Therefore it will be impossible for the AST based checker (i.e. non-ascii-names determine that there was an issue at all. Otherwise it would already detect it.

The only solution would be to check the source code in plain like the unicode checker. But we can't simply check for the occurrence of this kind of characters but need to extract all variable/name definitions.

So we kind of need to write a simple Python parser, in order that

valid = "fi_invalid = '23'"
# or even worse
still_valid = """
fi_invalid = "23"
"""

don't become false positives.

Also we need to consider function definitions, function argument definition, class definitions and so one. Here just some examples to get a glimpse of the complexity:

class Invalid_fi:
    def invalid_method_fi(invalid_argument_fi):
        ...
with open("file", "w") as invalid_fi:
    ...
try:
    a = 1/0
except ValueError as invalid_fi:
    ...

So really a kind of customer Python parser.

I would remove the "Good first issue", because I think it isn't ;-)

DanielNoord commented 2 years ago

@CarliJoy Thanks for this investigation. Did you check if these characters also get normalised in the tokens produced by tokenize? If not, this specific check is probably better handled by programs such as flake8 which use tokens instead of the ast representation.

CarliJoy commented 2 years ago

@DanielNoord I didn't. Wasn't on my mind.

DanielNoord commented 2 years ago

Btw, we also have token checkers so we could also implement these ourselves if the tokens do indeed retain these characters.

CarliJoy commented 2 years ago

Tokenize should work. Good idea!

python -m tokenize test.py 
0,0-0,0:            ENCODING       'utf-8'        
1,0-1,5:            NAME           'fi_fou'        
1,6-1,7:            OP             '='            
1,8-1,33:           STRING         '"This will be normalized"'
1,33-1,34:          NEWLINE        '\n'           
2,0-2,5:            NAME           'print'        
2,5-2,6:            OP             '('            
2,6-2,7:            OP             '['            
2,7-2,11:           NAME           'name'         
2,12-2,15:          NAME           'for'          
2,16-2,20:          NAME           'name'         
2,21-2,23:          NAME           'in'           
2,24-2,30:          NAME           'locals'       
2,30-2,31:          OP             '('            
2,31-2,32:          OP             ')'            
2,32-2,33:          OP             '.'            
2,33-2,37:          NAME           'keys'         
2,37-2,38:          OP             '('            
2,38-2,39:          OP             ')'            
2,40-2,42:          NAME           'if'           
2,43-2,47:          NAME           'name'         
2,47-2,48:          OP             '.'            
2,48-2,56:          NAME           'endswith'     
2,56-2,57:          OP             '('            
2,57-2,62:          STRING         '"fou"'        
2,62-2,63:          OP             ')'            
2,63-2,64:          OP             ']'            
2,64-2,65:          OP             ')'            
2,65-2,66:          NEWLINE        '\n'           
3,0-3,5:            NAME           'print'        
3,5-3,6:            OP             '('            
3,6-3,12:           NAME           'fi_fou'       
3,12-3,13:          OP             ')'            
3,13-3,14:          NEWLINE        '\n'           
4,0-4,0:            ENDMARKER      ''