python / cpython

The Python programming language
https://www.python.org
Other
63.38k stars 30.35k forks source link

End-of-file does not always mean that the file can no longer be read from #103436

Open micha030201 opened 1 year ago

micha030201 commented 1 year ago

Background

An "end-of-file condition" or EOF in POSIX is what happens when a process makes a read() call on a file descriptor and gets a return value of 0. Files can sometimes continue after an end-of-file condition has been encountered.

asyncio's connect_read_pipe on Unix can take files of three types -- FIFO (or unnamed pipe[^S_ISFIFO]), socket, and character device. EOF means different things in each of these file types, so let's go through them one by one.

Socket

In docs to read():

If fildes refers to a socket, read() shall be equivalent to recv() with no flags set.[^socket1]

In docs to recv():

Upon successful completion, recv() shall return the length of the message in bytes. If no messages are available to be received and the peer has performed an orderly shutdown, recv() shall return 0. Otherwise, −1 shall be returned and errno set to indicate the error.[^socket2]

What that means, however, varies based on the socket type. There are 4 of them: SOCK_DGRAM, SOCK_RAW, SOCK_SEQPACKET, and SOCK_STREAM[^socket_types]. Let's go through them.

SOCK_SEQPACKET and SOCK_STREAM

These are connection-mode, so when whoever we've been talking to shuts the socket down, the connection is closed and no more data will be sent.

Well, almost -- from the Linux man pages:

The value 0 may also be returned if the requested number of bytes to receive from a stream socket was 0.[^man_recv]

But provided that we don't call read() with length 0, EOF means that the connection is closed.

SOCK_DGRAM and SOCK_RAW

These, however, are connectionless-mode, so they can't be shut down from the other side, and thus we would never encounter EOF.

With datagram sockets there is a small caveat. In the Linux manual page for recv:

Datagram sockets in various domains (e.g., the UNIX and Internet domains) permit zero-length datagrams. When such a datagram is received, the return value is 0.[^man_recv]

However, on Linux that is not a problem, because it breaks POSIX-compatibility in this instance:

If a zero-length datagram is pending, read(2) and recv() with a flags argument of zero provide different behavior. In this circumstance, read(2) has no effect (the datagram remains pending), while recv() consumes the pending datagram.[^man_recv]

So on Linux, we're still fine. On a POSIX-compatible system, we would confuse receiving a zero-length datagram with a connection being closed.

Raw socket datagrams always include the IP header, so their length can never be 0[^man_raw].

FIFO/pipe

In docs to read():

When attempting to read from an empty pipe or FIFO:

  • If no process has the pipe open for writing, read() shall return 0 to indicate end-of-file.
  • If some process has the pipe open for writing and O_NONBLOCK is set, read() shall return −1 and set errno to [EAGAIN].
  • If some process has the pipe open for writing and O_NONBLOCK is clear, read() shall block the calling thread until some data is written or the pipe is closed by all processes that had the pipe open for writing.[^fifo]

What this says is that when a reader encounters EOF on a pipe, it means that the all writers have closed their FIFO or pipe file descriptors. It will keep encountering EOF with every call to read() while there are no writers.

If it's an unnamed pipe, it means that its write end is lost forever, because it can't be referred to in any other way than its file descriptor. If it's a FIFO, however, a new process can open it for writing and our reader will stop encountering EOF and will be able to proceed as it had before.

Character device

In docs to read():

No data transfer shall occur past the current end-of-file. If the starting position is at or after the end-of-file, 0 shall be returned. If the file refers to a device special file, the result of subsequent read() requests is implementation-defined.[^char_device]

That means that Linux or BSD or whoever can figure out whatever each of them wants to do when a reader encounters EOF on a character device. I have not tested any other systems, but Linux seems to generate an EOF exactly once, and then, if the device is still operational, all subsequent reads will proceed as normal.

Application

One of the consequences of that is that programs running in a terminal can keep reading from stdin and generally doing anything they want after the user presses Ctrl+D.

Some programs take advantage of that, one example is GDB -- if you press Ctrl+D while debugging a remote inferior, it will ask for confirmation and warn you that the inferior would be detached. If you change you mind and choose no, it will continue working as it was before.

It can do that without any issue because sending an EOF on a terminal is just a way to communicate with the process reading from it, it doesn't close the file or prevent further communication.

asyncio behavior

When a _UnixReadPipeTransport in asyncio encounters EOF, the following happens:

https://github.com/python/cpython/blob/aa874326d86734606a1930e7f48e2ee204cae0b6/Lib/asyncio/unix_events.py#L527-L532

It stops reading from that file (unregisters it from selection), and calls connection_lost on the protocol.

It also destroys itself and closes the file:

https://github.com/python/cpython/blob/aa874326d86734606a1930e7f48e2ee204cae0b6/Lib/asyncio/unix_events.py#L587-L594

This behavior is correct when we're dealing with stream (or technically datagram, but in that case it would just never execute) sockets or unnamed pipes, but it's incorrect if what we're reading from anything else -- a FIFO or a character device.

Consequences

It means that the kind of thing GDB does is not possible with asyncio's connect_read_pipe. Here's an example of what doesn't work:

import sys
import asyncio

async def read_stdin():
    loop = asyncio.get_running_loop()

    reader = asyncio.StreamReader()
    protocol = asyncio.StreamReaderProtocol(reader)
    await loop.connect_read_pipe(lambda: protocol, sys.stdin)

    async for line in reader:
        print('read:', line.decode())

    print('EOF')

    # Try to read again...
    async for line in reader:
        print('read:', line.decode())
    # ...we can't.

asyncio.run(read_stdin())

Additionally, stdin is now closed, so calling connect_read_pipe again will not work either.

The only ways to work around that on asyncio would be to either monkeypatch _UnixReadPipeTransport, or use os.dup. I suppose one could also wrap the pipe file object fed to connect_read_pipe to return a sentinel instead of an empty bytes object on EOF, and also modify the protocol to interpret that sentinel in the correct way. None of these options are particularly pleasant.

The same things works without issue with standard Python file API:

import sys

for line in sys.stdin:
    print('read:', line.strip())

print('EOF')

# Try to read again...
for line in sys.stdin:
    print('read:', line.strip())
# ...yay!

Possible solution

I don't know how Python developers feel about changing the asyncio API in ways that could subtly break programs, but I hope that the argument for correctness and the fact that some fairly important use cases are currently impossible are strong enough to warrant such a change.

Ideally, _UnixReadPipeTransport would be changed not to call connection_lost on EOF, instead only calling in on exception. The task of interpreting EOF then would be placed upon protocols. The list of protocols already includes specific protocols for datagrams, streams, and unnamed pipes, so it's not a huge stretch to also have specific protocols for FIFOs and character devices. Their behavior is different from all the protocols that are already there, and I would wager that reading from a terminal is a pretty common use case.

Alternatively, connect_read_pipe could return different transports for different file types, but that would require it to distinguish between unnamed pipes and FIFOs, which I'm not sure is possible.

Additionally, StreamReader does not work correctly when reading from a named pipe or a character device, as it assumes that EOF is a permanent state -- which is pretty reasonable, it is a stream reader and it works for stream sockets, the issue is that a character device or pipe is not one. It could be modified to accommodate for those types of files, should the responsibility for indicating that the connection is closed when a stream socket receives an EOF be put on Protocol. That would make stream in this instance an abstraction specific to asyncio and not one that refers to a POSIX stream socket. Otherwise, other types of readers (and probably writers, I didn't even touch on those) should be created. I admit that CharacterDeviceReader does not exactly roll off the tongue.


I can't really think of less drastic ways to fix this, but perhaps there are some that I'm missing. Anyway, thank you for reading this! This is the first issue I'm submitting in the Python repository and I've tried to make it as detailed as possible.

[^S_ISFIFO]: IEEE Std 1003.1-2017, Standard for Information Technology -- Portable Operating System Interface (POSIX), The Open Group Base Specifications Issue 7, 2018 Edition, ISBN 978-1-5044-4542-9 -- line 13390, page 393

[^socket1]: IEEE Std 1003.1-2017, Standard for Information Technology -- Portable Operating System Interface (POSIX), The Open Group Base Specifications Issue 7, 2018 Edition, ISBN 978-1-5044-4542-9 -- line 57251, page 1772

[^socket2]: IEEE Std 1003.1-2017, Standard for Information Technology -- Portable Operating System Interface (POSIX), The Open Group Base Specifications Issue 7, 2018 Edition, ISBN 978-1-5044-4542-9 -- lines 58076-58078, page 1793

[^socket_types]: IEEE Std 1003.1-2017, Standard for Information Technology -- Portable Operating System Interface (POSIX), The Open Group Base Specifications Issue 7, 2018 Edition, ISBN 978-1-5044-4542-9 -- section 2.10.6, lines 18376-18377, page 524

[^man_recv]: recv(2), Linux man-pages 6.04

[^man_raw]: raw(7), Linux man-pages 6.04

[^fifo]: IEEE Std 1003.1-2017, Standard for Information Technology -- Portable Operating System Interface (POSIX), The Open Group Base Specifications Issue 7, 2018 Edition, ISBN 978-1-5044-4542-9 -- lines 57221-57227, page 1771

[^char_device]: IEEE Std 1003.1-2017, Standard for Information Technology -- Portable Operating System Interface (POSIX), The Open Group Base Specifications Issue 7, 2018 Edition, ISBN 978-1-5044-4542-9 -- lines 57217, page 1771

sunmy2019 commented 1 year ago

1

In the doc it says, once EOF is received, no more data_received will be called, and connection_lost will be called on EOF. Transport itself will decide whether to close the connection or not.

These are clearly documented and widely used and implemented also in other event-loop implementations. Thus, it is impractical to change. I also intuitively against to change that.

2

In addition, I think the correct solution here is to call connect_read_pipe again with another stream reader. For example, with uvloop installed, you can write something like this. https://github.com/MagicStack/uvloop

import sys
import asyncio
import uvloop

async def read_stdin():
    loop = asyncio.get_running_loop()

    reader = asyncio.StreamReader()
    protocol = asyncio.StreamReaderProtocol(reader)
    await loop.connect_read_pipe(lambda: protocol, sys.stdin)

    async for line in reader:
        print("read:", line.decode())

    print("EOF")

    reader = asyncio.StreamReader()
    protocol = asyncio.StreamReaderProtocol(reader)
    await loop.connect_read_pipe(lambda: protocol, sys.stdin)
    async for line in reader:
        print("read:", line.decode())

uvloop.install()
asyncio.run(read_stdin())

This is exactly what you have expected, since uvloop will not close the underlying file when meeting eof.

3

I think the main problem is that, _UnixReadPipeTransport holds the ownership of the pipe. As mentioned in the comment here, it seems like a design choice from 10 years ago.

https://github.com/python/cpython/blob/aa874326d86734606a1930e7f48e2ee204cae0b6/Lib/asyncio/events.py#L487-L498

I do not know why it is designed this way. (But removing all self._pipe.close() does make the above code working 🤣)

gvanrossum commented 1 year ago

I am overwhelmed by the amount of detail in this bug report. Is there a specific use case you actively want? I am not interested in changing the code, possibly breaking existing uses, for some theoretical correctness reason.

micha030201 commented 1 year ago

I am overwhelmed by the amount of detail in this bug report. Is there a specific use case you actively want?

Apologies, I may have overdone it a little with the detail.

Yes, the use case in question is:

In the doc it says, once EOF is received, no more data_received will be called, and connection_lost will be called on EOF. Transport itself will decide whether to close the connection or not.

These are clearly documented and widely used and implemented also in other event-loop implementations. Thus, it is impractical to change. I also intuitively against to change that.

Thank you for pointing me to this. I can better see now that this is an intentional design choice, however it is still quite limiting, as I hope I've illustrated.

Datagram sockets can't be closed from the other side and can't encounter EOF, FIFOs can't be closed from the other side but can encounter EOF. One can be accommodated cleanly with this design, the other cannot.

In addition, I think the correct solution here is to call connect_read_pipe again with another stream reader.

That's the first workaround I attempted, which is when I figured out that the transport actually closes the file.

I don't really find it intuitive -- the file I'm reading from is the same, EOF is just in-band data for a terminal, why should I keep manually re-adding it to the selection every time it happens?

I suppose this is a matter of abstraction incompatibility -- asyncio protocols and transports look like they want to map neatly onto Unix concepts (streams, datagrams, pipes, etc.) but they don't.

I think the main problem is that, _UnixReadPipeTransport holds the ownership of the pipe. As mentioned in the comment here, it seems like a design choice from 10 years ago.

[...]

I do not know why it is designed this way. (But removing all self._pipe.close() does make the above code working 🤣)

This is indeed the main cause of pain in this issue and the core thing that should be addressed. The rest, I suppose, could be written off as conceptual differences between asyncio and Unix, which I still feel are unfortunate, but maybe that's just me.

It's interesting that uvloop doesn't have this issue but _UnixSelectorEventLoop does. I wonder if it leads to them having to manually close IPC unnamed pipes in other places.

gvanrossum commented 1 year ago

I would say that the terminal convention around ^D is the odd one out. Nothing else can receive data once EOF has been signaled -- with a socket or pipe the EOF is final. The StreamReader and pipe handling are designed around that (IMO more sensible) model. Maybe you can write your own transport for handling your use case?

micha030201 commented 1 year ago

I would say that the terminal convention around ^D is the odd one out. Nothing else can receive data once EOF has been signaled -- with a socket or pipe the EOF is final.

Not if it's a named pipe (aka FIFO), the behavior is different again for those. Much of my initial report is specifically about detailing those differences.

_UnixReadPipeTransport explicitly accepts character devices (which terminals are) as valid input, so I do consider it a bug that it doesn't work with them correctly.

I definitely could write my own transport for it, but I find it weird that am supposed to go to those lengths in order to read from a terminal.

I encountered this issue by debugging the debug command (funnily enough) in platformio -- they use asyncio to build an asynchronous proxy between the user and GDB, which asyncio seems well suited for, and clearly the developers of platformio thought the same. It's odd that it's actually impossible to do with asyncio without delving into the innards of it and writing a custom transport.

gvanrossum commented 1 year ago

The Transports + Protocols architecture was copied from Twisted. I'd like to know what Twisted does in this case.

I'd also like to receive less of an attitude of entitlement. There are exactly two remaining asyncio maintainers left, with very limited time and many other responsibilities; we have to prioritize.

micha030201 commented 1 year ago

I'd also like to receive less of an attitude of entitlement. There are exactly two remaining asyncio maintainers left, with very limited time and many other responsibilities; we have to prioritize.

I am sorry that I came across as entitled, this is very much not what I want to be. Thank you for pointing this out to me, I will try to check my tone better going forwards.

I don't call any shots here and I neither influence, nor really know what the priorities of the asyncio maintainers are. I have found a feature in asyncio that is easy to misuse, is misused in a some relatively big projects, and requires a workaround to use correctly, and I wanted to report it so that asyncio could handle it in a way that better accommodates this use case. I understand that I will have to write a workaround in my code for the current behavior either way, because even if asyncio was changed to accommodate this use case tomorrow, users can't be expected to immediately update to the latest version of Python.

What I ultimately want is to be helpful, and I apologize if I failed at that.

The Transports + Protocols architecture was copied from Twisted. I'd like to know what Twisted does in this case.

Similarly to asyncio, Twisted reports CONNECTION_DONE when it encounters EOF, calls connectionLost on the protocol, and doesn't call dataReceived on it anymore. It doesn't, however, close the file it was reading from.

I also checked how it's done in Trio, and they don't special-case EOF when reading from a pipe, and the docs make it clear that this API is intended to be used with TTYs.

gvanrossum commented 1 year ago

Sure, thanks for understanding.

Similarly to asyncio, Twisted reports CONNECTION_DONE when it encounters EOF, calls connectionLost on the protocol, and doesn't call dataReceived on it anymore. It doesn't, however, close the file it was reading from.

The function you link to just reads from the fd and calls the callback if any data was read. It looks like this convention is actually at the heart of the problem -- when no data is read, the data callback is not called, and instead some other protocol callback is called that indicates EOF. I guess there's a grammar or state diagram for which protocol methods can be called in which order, and it's something like "dataReceived connectionLost". What you are looking for would correspond to "(dataReceived | connectionLost)", but that leaves it unclear to the protocol when it is really done. Calling the callback with an empty argument would break all the (Twisted) code in the world so also wouldn't be a good idea.

I also checked how it's done in Trio, and they don't special-case EOF when reading from a pipe, and the docs make it clear that this API is intended to be used with TTYs.

But Trio doesn't use the Transport/Protocol abstractions, does it? Its Stream abstraction seems closer to a Unix file descriptor than to a Twisted Protocol/Transport pair.

Anyway, I'm not sure how we could fix this in asyncio without breaking existing, working code. When reading from an actual pipe (the intended use case for connect_read_pipe), when an empty byte string is received, something must signal the protocol that the other end of the pipe was closed (which is the inevitable conclusion when a pipe returns empty data). The API we have for that is to call eof_received followed by connection_lost. The convention for asyncio protocols is that when both of these are called, it's a "normal" EOF, whereas if the connection was closed prematurely, only connection_lost is called.

As with Twisted, I don't see how we could make it do what you want -- we can't call data_received(b"") on EOF, since that would break existing protocols, and we can't call data_received once eof_received has been called. The Transport unquivocally owns the file descriptor, there's no way to ask for it to give it back or anything like this.

In your application, perhaps you could work around it by using os.dup() to create a fresh fd to pass to the transport construction, and when you receive EOF, repeat this process. You could even write a protocol that preserves state and can be reused (it's eof_received and connection_lost callbacks should not change any state). But that seems the best we can do.

Unless you have a brilliant fix in mind that you were waiting to pull out of your sleeve when all other options are exhausted? :-)

GalaxySnail commented 1 year ago

If it's an unnamed pipe, it means that its write end is lost forever, because it can't be referred to in any other way than its file descriptor.

FWIW there is an interesting feature on Linux: you can reopen a pipe from procfs, for example:

>>> import os
>>> r, w = os.pipe()
>>> os.close(w)
>>> os.read(r, 16)
b''
>>> w = os.open(f"/proc/self/fd/{r}", os.O_WRONLY)
>>> os.write(w, b"hello")
5
>>> os.read(r, 16)
b'hello'

BTW, you can open a pipe even with O_RDWR, and a single pipe fd will be returned for both read and write.

micha030201 commented 1 year ago

FWIW there is an interesting feature on Linux: you can reopen a pipe from procfs, for example:

Interesting, I didn't know that. Thank you!

The function you link to just reads from the fd and calls the callback if any data was read. It looks like this convention is actually at the heart of the problem -- when no data is read, the data callback is not called, and instead some other protocol callback is called that indicates EOF.

Yes, that is exactly what happens -- the reactor sees the truthy return value and removes the file descriptor from selection, calling connectionLost on it. Much the same as asyncio does it, except it's not a transport doing it but the reactor.

Anyway, I'm not sure how we could fix this in asyncio without breaking existing, working code. When reading from an actual pipe (the intended use case for connect_read_pipe), when an empty byte string is received, something must signal the protocol that the other end of the pipe was closed (which is the inevitable conclusion when a pipe returns empty data). The API we have for that is to call eof_received followed by connection_lost. The convention for asyncio protocols is that when both of these are called, it's a "normal" EOF, whereas if the connection was closed prematurely, only connection_lost is called.

This is very true. While I originally thought eof_received could be decoupled from connection_lost, leaving the protocol to figure out what to do in which case, I can see now that this would break a guarantee that is currently provided by the transport API.

As with Twisted, I don't see how we could make it do what you want -- we can't call data_received(b"") on EOF, since that would break existing protocols, and we can't call data_received once eof_received has been called. The Transport unquivocally owns the file descriptor, there's no way to ask for it to give it back or anything like this.

I don't know how crucial it is to the design that the transport owns the file descriptor (uvloop doesn't seem to do that, as sunmy2019 mentioned earlier in the thread), but honestly even if it didn't the API is very much suited for reading specifically from an unnamed pipe and not a terminal.

In your application, perhaps you could work around it by using os.dup() to create a fresh fd to pass to the transport construction, and when you receive EOF, repeat this process. You could even write a protocol that preserves state and can be reused (it's eof_received and connection_lost callbacks should not change any state). But that seems the best we can do.

I'll have to think how best to do that, it's one of the options I've been considering. I thought that writing my own Stream-like class that uses the API for watching file descriptors directly could be cleaner, though it can't exactly conform to the asyncio StreamReader and StreamWriter API, as it also assumes that an EOF is final.

Unless you have a brilliant fix in mind that you were waiting to pull out of your sleeve when all other options are exhausted? :-)

I suppose the only fix would be to add an new API that is specifically designed for reading from and writing to terminals, though I understand that this is a lot of design and implementation work :sweat_smile:

If you are not inclined to do that, which I would very much understand, that kind of leaves asyncio with an API that can be used for terminals, but is misleadingly not quite complete. Perhaps a note could be added to the documentation for connect_read_pipe that using it for reading from a terminal is not the intended usage, because currently people seem to be using it for that quite a lot.

gvanrossum commented 1 year ago

If you are not inclined to do that, which I would very much understand, that kind of leaves asyncio with an API that can be used for terminals, but is misleadingly not quite complete. Perhaps a note could be added to the documentation for connect_read_pipe that using it for reading from a terminal is not the intended usage, because currently people seem to be using it for that quite a lot.

But it's totally fine unless you want to keep reading after receiving EOF, right? At best we should warn in the docs about that.

And indeed I don't want asyncio to grow another API for this corner case. It really belongs in a 3rd party package. IMO we already provide too many APIs (fortunately, at least HTTP support is 3rd party, but unfortunately TLS support is baked in).

micha030201 commented 1 year ago

But it's totally fine unless you want to keep reading after receiving EOF, right? At best we should warn in the docs about that.

That is correct. Is that documentation addition something that you would like me to do as a PR?

gvanrossum commented 1 year ago

@micha030201 Sure, send a PR and I'll review it.

willingc commented 1 year ago

@micha030201 Please do send in a docs PR. That seems the best way to move forward with this issue. Thanks!