Closed asottile closed 2 years ago
There was a discussion about the larger issue of code being ”misleadingly different from what it looks like", and the general consensus was that this should be solved in editors, linters and code review tools, rather than in Python. IMO it was this thread but might need more digging.
That doesn't mean handling of \0
can't be changed. Or of control characters in general: see the informational PEP 672 for some more details.
Perhaps all unusual control characters should be banned? Even in strings? With a warning for 2 releases, according to the backwards compatibility policy – waf would need to switch to e.g. base64
.
that was the rationale for making this public, but it's different than those as it's a mishandling by the parser rather than a quirk of how unicode displays
Ugh, this is freakin amazing.
C:\Users\user>more t.py
"""
How do i print 'Hello World' in Python?
Here is how:
print('Hello World')
"""
And run it:
C:\Users\user>py t.py
Hello World
open('t.py', 'w').write("\0'''\0How do i print \'Hello World\' in Python?\n\0Here is how:\nprint(\'Hello World\')\0'''")
ideally I would want the interpreter to reject files containing null bytes as a SyntaxError (and update the ast.parse error to a SyntaxError as well)
FWIW I'd be in favor of that behavior change as a Feature. I wouldn't backport it as a bug though.
there's kinda 3 things I think:
ValueError
=> SyntaxError
in ast.parse
(probably 3.12+)SyntaxError
(probably also 3.12+)(char*, size_t)
or whatever function (probably backportable?)does it make sense to pursue these three things and separately?
there's kinda 3 things I think:
1. convert `ValueError` => `SyntaxError` in `ast.parse` (probably 3.12+)
Agreed, it's changing behavior, so can't backport. I like SyntaxError for this situation. The root exception isn't in ast.parse(), it's in compile(), probably even deeper.
2. treat source files containing null bytes as `SyntaxError` (probably also 3.12+)
That's the biggest hole, should probably do this first.
3. fix the thing considering it as a c string by using `(char*, size_t)` or whatever function (probably backportable?)
Depends on where that fix is (can you tell I haven't looked at the source code yet? :-). If any of the affected functions are public it's going to be more difficult.
does it make sense to pursue these three things and separately?
Likely.
2. treat source files containing null bytes as `SyntaxError` (probably also 3.12+)
That's the biggest hole, should probably do this first.
Be careful about this one. Python pre-pended to raw binary data for the executed Python to locate within the file and use (embedded zip or other data) is a common idiom that must continue to work.
Be careful about this one. Python pre-pended to raw binary data for the executed Python to locate within the file and use (embedded zip or other data) is a common idiom that must continue to work.
Are you sure that works? Unlike Unix shells, Python parses the entire source file before executing any code. How would you get Python to ignore a blob of arbitrary binary data embedded in the source code, even if \0
is accepted? If the blob contains \n
characters you can't hide it behind a #
comment. I suppose you could prefix it with a """
quote, if you can arrange for the file to also end in """
, and you're lucky that the blob doesn't contain embedded """
sequences.
But if I had to do something like that I'd probably just embed the Python code in a bash script as a "here" document and end the bash script with exit
.
Oh you're right, I guess what I've seen do that is a bash+python+data hybrid monster.
Okay then we should be safe banning \0
in files starting with 3.12.
I suspect one could probably hack something together like this unfortunately
# coding: latin1
with open(__file__, 'rb') as f:
contents = f.read().split(b'### BINARY\n')[1]
GARB = '''\
### BINARY
(actual binary here)
### BINARY
'''
You'd still have to arrange for the actual binary not to contain the sequence '''
-- because that would end the string started at GARB =
. I don't see how setting the coding to Latin-1 changes matters. (Honestly it seems you're making the same mistake as Greg.)
latin1 is to prevent a decoding error while parsing the source -- I linked an in-the-wild example of such a file in the original post
Okay, that's impressive. It looks like the blob is lightly encoded -- \n
and \r
are encoded using #.
and #&
. (I'm curious what they'd do if the blob contains one of those sequences, there doesn't seem to be a way to quote them.)
The proposed change in 3.12 will break them, but they have version checks and they can just cope with it, I don't think we need to preserve this machine-dependent quirk forever. The code looks like it had to deal with various other versioning issues already (e.g. it tries to handle Python 2 and 3!).
fix the thing considering it as a c string by using (char*, size_t) or whatever function (probably backportable?)
The root issue of this bug is that the Python parser is implemented in C which treats the NUL byte/character as the string terminator? So it's more a limitation of the current CPython implementation, but Python might support NUL bytes/characters later? Or do you think that it's always a bad practice to have NUL bytes/characters in a source file?
IMO it's always misleading to have NUL in a source file and it should be banned. I don't see any legit use case for that.
By the way, in Python, it's trivial there are many ways to create a byte, character, or string containing NUL:
bytes((0,))
: bytechr(0)
: characterb'[\x00]'
: bytes string'[\x00]'
: Unicode stringI agree we should always ban NUL in source files, like we already do when parsing from a string.
Oh right, I didn't notice that!
>>> compile("x=1\x00", "string", "exec")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: source code string cannot contain null bytes
So yeah, I agree to always ban null characters/bytes in files as well.
I suspect one could probably hack something together like this unfortunately
# coding: latin1 with open(__file__, 'rb') as f: contents = f.read().split(b'### BINARY\n')[1] GARB = '''\ ### BINARY (actual binary here) ### BINARY '''
This is actually a real (albeit very odd) use case for me! It can be used to demonstrate hash collision attacks. There are tools which will generate two blobs of binary data with the same MD5 hash, which can be used to create two Python files with the same hash, but different behaviors. This is a fairly common lab for intro security classes (example, other example). This is just another reason to disable this functionality IMO, but there could be other unexpected reasons why somebody may need to include arbitrary binary in a source file.
It seems like the issue #97556 is a duplicate of this issue.
It seems like the issue #97556 is a duplicate of this issue.
They're very similar, although I don't believe they're exact duplicates- #97556 deals with a particular parsing error in Python 3.10, which fatally fails to parse a string literal containing a NULL character. This issue deals with the larger problem of all source code being ignored after a null character: it applies even outside of string literals, although doing so is a syntax error, and in pre-Python 3.10 versions as well.
Closing this. Starting with Python 3.12, NUL bytes won't be allowed in source code read from files. We're not backporting this since this could be considered a feature by some.
Starting with Python 3.12, NUL bytes won't be allowed in source code read from files
It was changed by the commit aab01e3524d966dca6e72c718a2c71403a14e47c: PR #97594.
Just for reference: The problem is already mentioned in https://peps.python.org/pep-0672/#control-characters
And Pylint checks for null characters already.
Bug report
the parser mishandles lines containing null bytes when parsing source -- this allows the code to be misleadingly different from what it looks like.
I've been told by security@ that it is ok to post this publicly.
in the below example,
<NUL>
is an actual null byte:and the execution and appearance in the terminal:
it appears that after splitting the source into lines, the individual lines are treated as c strings and so the null terminator is misinterpreted, jamming the string contents together and it executes similar to this:
note that if you want to write out a file like this here's a simple bit of code you can paste into an interactive prompt:
here is perhaps a shorter example:
I originally found this due to a bug report where the
ast
parser rejects code containing null bytes:ideally I would want the interpreter to reject files containing null bytes as a
SyntaxError
(and update theast.parse
error to aSyntaxError
as well) -- though it appears there are some of these files in the wild -- such as https://github.com/univention/univention-corporate-server/blob/5.0-2/services/univention-ldb-modules/buildtools/bin/waf-svnYour environment
Linked PRs