asottile commented 2 years ago

Bug report

the parser mishandles lines containing null bytes when parsing source -- this allows the code to be misleadingly different from what it looks like.

I've been told by security@ that it is ok to post this publicly.

in the below example, <NUL> is an actual null byte:

x = '<NUL>' nothing to see here
';import os;os.system('echo pwnd')

and the execution and appearance in the terminal:

$ cat t.py
x = '' nothing to see here
';import os;os.system('echo pwnd')
$ python3 t.py
pwnd

it appears that after splitting the source into lines, the individual lines are treated as c strings and so the null terminator is misinterpreted, jamming the string contents together and it executes similar to this:

x = '';import os;os.system('echo pwnd')

note that if you want to write out a file like this here's a simple bit of code you can paste into an interactive prompt:

open('t.py', 'w').write("x = '\0' nothing to see here\n';import os;os.system('echo pwnd')\n")

here is perhaps a shorter example:

open('t.py', 'w').write("x = 1\0 + 1\n+2\nprint(x)\n")

I originally found this due to a bug report where the ast parser rejects code containing null bytes:

>>> import ast
>>> ast.parse("x = '\0'")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.8/ast.py", line 47, in parse
    return compile(source, filename, mode, flags,
ValueError: source code string cannot contain null bytes
>>> ast.parse(b"x = '\0'")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.8/ast.py", line 47, in parse
    return compile(source, filename, mode, flags,
ValueError: source code string cannot contain null bytes

ideally I would want the interpreter to reject files containing null bytes as a SyntaxError (and update the ast.parse error to a SyntaxError as well) -- though it appears there are some of these files in the wild -- such as https://github.com/univention/univention-corporate-server/blob/5.0-2/services/univention-ldb-modules/buildtools/bin/waf-svn

Your environment

CPython versions tested on: 3.7 ... 3.11rc1 (though pretty sure this reproduces on all versions)
Operating system and architecture: ubuntu 22.04, linux, x86_64

Linked PRs

gh-104195

encukou commented 2 years ago

There was a discussion about the larger issue of code being ”misleadingly different from what it looks like", and the general consensus was that this should be solved in editors, linters and code review tools, rather than in Python. IMO it was this thread but might need more digging. That doesn't mean handling of \0 can't be changed. Or of control characters in general: see the informational PEP 672 for some more details. Perhaps all unusual control characters should be banned? Even in strings? With a warning for 2 releases, according to the backwards compatibility policy – waf would need to switch to e.g. base64.

asottile commented 2 years ago

that was the rationale for making this public, but it's different than those as it's a mishandling by the parser rather than a quirk of how unicode displays

ypankovych commented 2 years ago

Ugh, this is freakin amazing.

C:\Users\user>more t.py

"""
How do i print 'Hello World' in Python?

Here is how:
print('Hello World')
"""

And run it:

C:\Users\user>py t.py
Hello World

open('t.py', 'w').write("\0'''\0How do i print \'Hello World\' in Python?\n\0Here is how:\nprint(\'Hello World\')\0'''")

gpshead commented 2 years ago

ideally I would want the interpreter to reject files containing null bytes as a SyntaxError (and update the ast.parse error to a SyntaxError as well)

FWIW I'd be in favor of that behavior change as a Feature. I wouldn't backport it as a bug though.

asottile commented 2 years ago

there's kinda 3 things I think:

convert ValueError => SyntaxError in ast.parse (probably 3.12+)
treat source files containing null bytes as SyntaxError (probably also 3.12+)
fix the thing considering it as a c string by using (char*, size_t) or whatever function (probably backportable?)

does it make sense to pursue these three things and separately?

gvanrossum commented 2 years ago

there's kinda 3 things I think:
1. convert `ValueError` => `SyntaxError` in `ast.parse` (probably 3.12+)

Agreed, it's changing behavior, so can't backport. I like SyntaxError for this situation. The root exception isn't in ast.parse(), it's in compile(), probably even deeper.

2. treat source files containing null bytes as `SyntaxError` (probably also 3.12+)

That's the biggest hole, should probably do this first.

3. fix the thing considering it as a c string by using `(char*, size_t)` or whatever function (probably backportable?)

Depends on where that fix is (can you tell I haven't looked at the source code yet? :-). If any of the affected functions are public it's going to be more difficult.

does it make sense to pursue these three things and separately?

Likely.

gpshead commented 2 years ago

2. treat source files containing null bytes as `SyntaxError` (probably also 3.12+)
That's the biggest hole, should probably do this first.

Be careful about this one. Python pre-pended to raw binary data for the executed Python to locate within the file and use (embedded zip or other data) is a common idiom that must continue to work.

gvanrossum commented 2 years ago

Be careful about this one. Python pre-pended to raw binary data for the executed Python to locate within the file and use (embedded zip or other data) is a common idiom that must continue to work.

Are you sure that works? Unlike Unix shells, Python parses the entire source file before executing any code. How would you get Python to ignore a blob of arbitrary binary data embedded in the source code, even if \0 is accepted? If the blob contains \n characters you can't hide it behind a # comment. I suppose you could prefix it with a """ quote, if you can arrange for the file to also end in """, and you're lucky that the blob doesn't contain embedded """ sequences.

But if I had to do something like that I'd probably just embed the Python code in a bash script as a "here" document and end the bash script with exit.

gpshead commented 2 years ago

Oh you're right, I guess what I've seen do that is a bash+python+data hybrid monster.

gvanrossum commented 2 years ago

Okay then we should be safe banning \0 in files starting with 3.12.

asottile commented 2 years ago

I suspect one could probably hack something together like this unfortunately

# coding: latin1
with open(__file__, 'rb') as f:
    contents = f.read().split(b'### BINARY\n')[1]
GARB = '''\
### BINARY
(actual binary here)
### BINARY
'''

gvanrossum commented 2 years ago

You'd still have to arrange for the actual binary not to contain the sequence ''' -- because that would end the string started at GARB =. I don't see how setting the coding to Latin-1 changes matters. (Honestly it seems you're making the same mistake as Greg.)

asottile commented 2 years ago

latin1 is to prevent a decoding error while parsing the source -- I linked an in-the-wild example of such a file in the original post

gvanrossum commented 2 years ago

Okay, that's impressive. It looks like the blob is lightly encoded -- \n and \r are encoded using #. and #&. (I'm curious what they'd do if the blob contains one of those sequences, there doesn't seem to be a way to quote them.)

The proposed change in 3.12 will break them, but they have version checks and they can just cope with it, I don't think we need to preserve this machine-dependent quirk forever. The code looks like it had to deal with various other versioning issues already (e.g. it tries to handle Python 2 and 3!).

vstinner commented 2 years ago

fix the thing considering it as a c string by using (char*, size_t) or whatever function (probably backportable?)

The root issue of this bug is that the Python parser is implemented in C which treats the NUL byte/character as the string terminator? So it's more a limitation of the current CPython implementation, but Python might support NUL bytes/characters later? Or do you think that it's always a bad practice to have NUL bytes/characters in a source file?

IMO it's always misleading to have NUL in a source file and it should be banned. I don't see any legit use case for that.

By the way, in Python, it's trivial there are many ways to create a byte, character, or string containing NUL:

bytes((0,)): byte
chr(0): character
b'[\x00]': bytes string
'[\x00]': Unicode string

gvanrossum commented 2 years ago

I agree we should always ban NUL in source files, like we already do when parsing from a string.

vstinner commented 2 years ago

Oh right, I didn't notice that!

>>> compile("x=1\x00", "string", "exec")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: source code string cannot contain null bytes

So yeah, I agree to always ban null characters/bytes in files as well.

apccurtiss commented 2 years ago

I suspect one could probably hack something together like this unfortunately
# coding: latin1
with open(__file__, 'rb') as f:
    contents = f.read().split(b'### BINARY\n')[1]
GARB = '''\
### BINARY
(actual binary here)
### BINARY
'''

This is actually a real (albeit very odd) use case for me! It can be used to demonstrate hash collision attacks. There are tools which will generate two blobs of binary data with the same MD5 hash, which can be used to create two Python files with the same hash, but different behaviors. This is a fairly common lab for intro security classes (example, other example). This is just another reason to disable this functionality IMO, but there could be other unexpected reasons why somebody may need to include arbitrary binary in a source file.

vstinner commented 2 years ago

It seems like the issue #97556 is a duplicate of this issue.

apccurtiss commented 2 years ago

It seems like the issue #97556 is a duplicate of this issue.

They're very similar, although I don't believe they're exact duplicates- #97556 deals with a particular parsing error in Python 3.10, which fatally fails to parse a string literal containing a NULL character. This issue deals with the larger problem of all source code being ignored after a null character: it applies even outside of string literals, although doing so is a syntax error, and in pre-Python 3.10 versions as well.

gvanrossum commented 2 years ago

Closing this. Starting with Python 3.12, NUL bytes won't be allowed in source code read from files. We're not backporting this since this could be considered a feature by some.

vstinner commented 2 years ago

Starting with Python 3.12, NUL bytes won't be allowed in source code read from files

It was changed by the commit aab01e3524d966dca6e72c718a2c71403a14e47c: PR #97594.

CarliJoy commented 2 years ago

Just for reference: The problem is already mentioned in https://peps.python.org/pep-0672/#control-characters

And Pylint checks for null characters already.

python / cpython

mishandling of c-strings in parser #96670

Bug report

Your environment

Linked PRs