Open ickc opened 3 years ago
Thank you, I did not know about this behavior, it makes a lot of sense to add a check for that in pylint. Would implicit-identifier-normalization
work as a name ?
That sounds good. Thanks! I added a list of the possible places this happens at the end of the original post above. Feel free to add more if you found some missing.
More context: https://lwn.net/ml/python-dev/85d2c121-86a9-e04a-6dfa-da36b073a95d@gmail.com/
In the following case I'd say it's OK to warn about using a not-normalized identifier, it's an easy fix for the dev:
fi_fou = 42 # Using the ligature U+FB01 LATIN SMALL LIGATURE FI
print(fi_fou) # Using only codepoints < 127
In the following case it's more complicated to fix as the two looks identical, it's probably the case where normalization just plays its role. Bad thing, the human users with keyboards will input the 1st one, while the NFKD is the 2nd one, so it's not (easily?) fixable by using a keyboard:
étage = 42 # Using U+00E9 LATIN SMALL LETTER E WITH ACUTE
print(étage) # Using U+0065 LATIN SMALL LETTER E then U+0301 COMBINING ACUTE ACCENT
But it can be detected as legitimate as it "rounds-trip" via canonical composition (examples below).
In the following one, warning about normalisation is clearly a good idea, as it leads to a "visual bug":
>>> ϕ = 1
>>> φ = 2
>>> print(ϕ, φ)
2, 2
About the 'it's legit if it rounds trip" here's a demo of what I have in mind:
def is_legit(name):
"""Returns True if the name is already in the expected denormalized form,
None if it's not but acceptable, and False if not.
"""
if unicodedata.normalize("NFKD", name) == name:
return True
if unicodedata.normalize("NFKC", unicodedata.normalize("NFKD", name)) == name:
return None
return False
Which gives:
>>> is_legit("abc123") # A normal, plain, old, ASCII identifier
True
>>> print(is_legit("été")) # That's how a french keyboard would input été, in a composed form
None # means it's not normalized but still acceptable.
>>> print(is_legit("étage")) # This is the denormalized form
True # But I don't know how to input this with a keyboard :D
>>> print(is_legit("étage")) # This is the composed form ("keyboard" / "user" input form)
None # The way user write it is acceptable, youhouuuuuuuu!
>>> print(is_legit("ϕ")) # This one is the one introducing a "visual" bug in my last example
False # Note this is the only one refused by my function.
>>> print(is_legit("φ")) # It's normalized form.
True
From @JulienPalard example, following the amazing work on the unicode checker by @CarliJoy there's only one false negative left as pylint now warns about non ascii name:
fi_fou = 42 # Using the ligature U+FB01 LATIN SMALL LIGATURE FI
print(fi_fou) # Using only codepoints < 127
étage = 42 # Using U+00E9 LATIN SMALL LETTER E WITH ACUTE
print(étage) # Using U+0065 LATIN SMALL LETTER E then U+0301 COMBINING ACUTE ACCENT
ϕ = 1
φ = 2
print(ϕ, φ)
a.py:5:0: C2401: Variable name "étage" contains a non-ASCII character, consider renaming it. (non-ascii-name)
a.py:7:0: C2401: Variable name "φ" contains a non-ASCII character, consider renaming it. (non-ascii-name)
a.py:8:0: C2401: Variable name "φ" contains a non-ASCII character, consider renaming it. (non-ascii-name)
The first example with LATIN SMALL LIGATURE FI is still not detected by pylint. I'm pretty sure the unicode checker will be easy to extends to fix this as it's pretty modular and well designed.
If someone want to fix this issue, feel free to contact me.
I had a look into it. I don't think this is solvable with acceptable effort.
The issues is that Python normalizes already when defining the variable:
Therefore it will be impossible for the AST based checker (i.e. non-ascii-names determine that there was an issue at all. Otherwise it would already detect it.
The only solution would be to check the source code in plain like the unicode checker. But we can't simply check for the occurrence of this kind of characters but need to extract all variable/name definitions.
So we kind of need to write a simple Python parser, in order that
valid = "fi_invalid = '23'"
# or even worse
still_valid = """
fi_invalid = "23"
"""
don't become false positives.
Also we need to consider function definitions, function argument definition, class definitions and so one. Here just some examples to get a glimpse of the complexity:
class Invalid_fi:
def invalid_method_fi(invalid_argument_fi):
...
with open("file", "w") as invalid_fi:
...
try:
a = 1/0
except ValueError as invalid_fi:
...
So really a kind of customer Python parser.
I would remove the "Good first issue", because I think it isn't ;-)
@CarliJoy Thanks for this investigation. Did you check if these characters also get normalised in the tokens
produced by tokenize
? If not, this specific check is probably better handled by programs such as flake8
which use tokens instead of the ast representation.
@DanielNoord I didn't. Wasn't on my mind.
Btw, we also have token checkers so we could also implement these ourselves if the tokens do indeed retain these characters.
Tokenize should work. Good idea!
python -m tokenize test.py
0,0-0,0: ENCODING 'utf-8'
1,0-1,5: NAME 'fi_fou'
1,6-1,7: OP '='
1,8-1,33: STRING '"This will be normalized"'
1,33-1,34: NEWLINE '\n'
2,0-2,5: NAME 'print'
2,5-2,6: OP '('
2,6-2,7: OP '['
2,7-2,11: NAME 'name'
2,12-2,15: NAME 'for'
2,16-2,20: NAME 'name'
2,21-2,23: NAME 'in'
2,24-2,30: NAME 'locals'
2,30-2,31: OP '('
2,31-2,32: OP ')'
2,32-2,33: OP '.'
2,33-2,37: NAME 'keys'
2,37-2,38: OP '('
2,38-2,39: OP ')'
2,40-2,42: NAME 'if'
2,43-2,47: NAME 'name'
2,47-2,48: OP '.'
2,48-2,56: NAME 'endswith'
2,56-2,57: OP '('
2,57-2,62: STRING '"fou"'
2,62-2,63: OP ')'
2,63-2,64: OP ']'
2,64-2,65: OP ')'
2,65-2,66: NEWLINE '\n'
3,0-3,5: NAME 'print'
3,5-3,6: OP '('
3,6-3,12: NAME 'fi_fou'
3,12-3,13: OP ')'
3,13-3,14: NEWLINE '\n'
4,0-4,0: ENDMARKER ''
Current problem
Unicode normalization in unicode identifier. c.f.
__all__
- Python trackerThe issue is that while some characters can be used as identifier, internally it is normalized. So when looked up using
__all__
orgetattr
, it would results in an error. (See the 2nd & 3rd link for respective example.)Desired solution
pylint can warns about an identifier when normalized is different, and suggested a normalized identifier instead. i.e. it notifies that the identifier will be converted internally silently.
Similarly, pylint can warns those inside
__all__
andgetattr
too, as they can never be looked up due to no identifier would have such an unnormalized string.Additional context
Edit: keys of
globals()
&locals()
also shares the same problem:Keys to check
__all__[key]
getattr(..., key)
__getattr__(..., key)
globals()[key]
locals()[key]