python / cpython

The Python programming language
https://www.python.org
Other
62.37k stars 29.95k forks source link

Should compression file-like objects provide .fileno(), misleading subprocess? #68546

Open 99ffcaa5-b43b-4e8e-a35e-9c890007b9cd opened 9 years ago

99ffcaa5-b43b-4e8e-a35e-9c890007b9cd commented 9 years ago
BPO 24358
Nosy @pitrou, @benjaminp, @jonashaag, @vadmium, @MojoVampire

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['expert-IO'] title = 'Should compression file-like objects provide .fileno(), misleading subprocess?' updated_at = user = 'https://github.com/MojoVampire' ``` bugs.python.org fields: ```python activity = actor = 'martin.panter' assignee = 'none' closed = False closed_date = None closer = None components = ['IO'] creation = creator = 'josh.r' dependencies = [] files = [] hgrepos = [] issue_num = 24358 keywords = [] message_count = 7.0 messages = ['244626', '244632', '244652', '244663', '265455', '265464', '265523'] nosy_count = 6.0 nosy_names = ['pitrou', 'benjamin.peterson', 'stutzbach', 'jonash', 'martin.panter', 'josh.r'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = None url = 'https://bugs.python.org/issue24358' versions = ['Python 2.7', 'Python 3.4', 'Python 3.5', 'Python 3.6'] ```

99ffcaa5-b43b-4e8e-a35e-9c890007b9cd commented 9 years ago

subprocess, when accepting objects for stdin, stdout, and stderr, assumes that possessing a .fileno() means it's a legitimate object for use with the forked process; that the file descriptor is interchangeable with the object itself. But gzip, bz2 and lzma file-like objects all violate this rule; they provide .fileno(), but it's unadorned. Providing .fileno() on these objects is misleading, since they produce the uncompressed data (likely useless) which causes subprocess to pass the "wrong" data to the subprocess, or write uncompressed data from the process (the exception being processes that expect compressed data from stdin or write compressed data to stdout, but that usually just means the compressor utilities themselves).

Is subprocess's assumption about fileno() (that you can read equivalent data from it, modulo issues with flushing/seeking) intended? If so, should .fileno() be removed from the compressed file interfaces? If not, should subprocess attempt to perform further checking, document this wart, or something else?

99ffcaa5-b43b-4e8e-a35e-9c890007b9cd commented 9 years ago

Apparently similar issue occurs when tarfile assumes a GzipFile can have its fileno() fstat-ed (see bpo-22468). An awful lot of libraries seem to assume that fileno() will provide useful information about the data you'd read from the file-like object itself, but all the compressed file-like objects violate that expectation.

vadmium commented 9 years ago

Also related: bpo-23740, where the HTTP client assumes it can use stat() on the fileno() to determine the Content-Length.

Providing fileno() on file wrapper objects like GzipFile is certainly not necessary, but it could be useful. For instance in the tarfile case, the modification time, file mode, owner user, etc may still be useful even if the file size isn’t.

On the other hand, fileno() is a low level operation, so maybe it should only have been made available on light-weight RawIOBase objects or something. Even for a BufferedReader/Writer or TextIOWrapper, the data read from or written to the Python-level file object does not match the corresponding file descriptor operations. You could get deadlocks, data loss, etc, due to buffering. For example the commented-out-code near the bottom of \https://bugs.python.org/review/1191964/patch/11993/43982#newcode1901\.

The subprocess module documentation only says that the streams can be “existing file objects”. I think it should at least be clarified to say the file objects are taken to represent OS-level file descriptors.

99ffcaa5-b43b-4e8e-a35e-9c890007b9cd commented 9 years ago

Blech, typo earlier "since they produce the *compressed* data (likely useless) when read as subprocess stdin". Context should make it obvious, but trying to be clear.

ef46bb13-8e88-4488-a20f-75e542f6f274 commented 8 years ago

I just hit this too. I'd say remove the fileno() method from wrapper objects like GzipFile. I'm happy to submit a patch.

vadmium commented 8 years ago

Where would you draw the line though? At one extreme, BufferedWriter is a wrapper, but if you removed BufferedWriter.fileno() it would break all the code that does

with open(os.devnull, "wb") as null:
    proc = subprocess.Popen(..., stdout=null)

Would you remove it from HTTPResponse? Some HTTP responses can be chunked, so the child process would see the chunk headers using fileno(). But other HTTP responses are more direct and would work smoothly with the subprocess module.

Considering the compatibility problems and other possible uses of fileno(), I suspect removing it would be a bad idea. “Throwing the baby out with the bathwater” comes to mind. A less drastic change would be to require an explicit fileno() call and only passing the file descriptor to subprocess.

There is already a bug open for improving the documentation: bpo-19992

vadmium commented 8 years ago

Also stumbled upon bpo-1705393: confusion with select() and buffered files