Open micha030201 opened 1 year ago
In the doc it says, once EOF is received, no more data_received
will be called, and connection_lost
will be called on EOF. Transport itself will decide whether to close the connection or not.
These are clearly documented and widely used and implemented also in other event-loop implementations. Thus, it is impractical to change. I also intuitively against to change that.
In addition, I think the correct solution here is to call connect_read_pipe
again with another stream reader. For example, with uvloop
installed, you can write something like this. https://github.com/MagicStack/uvloop
import sys
import asyncio
import uvloop
async def read_stdin():
loop = asyncio.get_running_loop()
reader = asyncio.StreamReader()
protocol = asyncio.StreamReaderProtocol(reader)
await loop.connect_read_pipe(lambda: protocol, sys.stdin)
async for line in reader:
print("read:", line.decode())
print("EOF")
reader = asyncio.StreamReader()
protocol = asyncio.StreamReaderProtocol(reader)
await loop.connect_read_pipe(lambda: protocol, sys.stdin)
async for line in reader:
print("read:", line.decode())
uvloop.install()
asyncio.run(read_stdin())
This is exactly what you have expected, since uvloop
will not close the underlying file when meeting eof.
I think the main problem is that, _UnixReadPipeTransport
holds the ownership of the pipe. As mentioned in the comment here, it seems like a design choice from 10 years ago.
I do not know why it is designed this way. (But removing all self._pipe.close()
does make the above code working 🤣)
I am overwhelmed by the amount of detail in this bug report. Is there a specific use case you actively want? I am not interested in changing the code, possibly breaking existing uses, for some theoretical correctness reason.
I am overwhelmed by the amount of detail in this bug report. Is there a specific use case you actively want?
Apologies, I may have overdone it a little with the detail.
Yes, the use case in question is:
stdin
, which is connected to a terminal,stdin
.In the doc it says, once EOF is received, no more
data_received
will be called, andconnection_lost
will be called on EOF. Transport itself will decide whether to close the connection or not.These are clearly documented and widely used and implemented also in other event-loop implementations. Thus, it is impractical to change. I also intuitively against to change that.
Thank you for pointing me to this. I can better see now that this is an intentional design choice, however it is still quite limiting, as I hope I've illustrated.
Datagram sockets can't be closed from the other side and can't encounter EOF, FIFOs can't be closed from the other side but can encounter EOF. One can be accommodated cleanly with this design, the other cannot.
In addition, I think the correct solution here is to call
connect_read_pipe
again with another stream reader.
That's the first workaround I attempted, which is when I figured out that the transport actually closes the file.
I don't really find it intuitive -- the file I'm reading from is the same, EOF is just in-band data for a terminal, why should I keep manually re-adding it to the selection every time it happens?
I suppose this is a matter of abstraction incompatibility -- asyncio protocols and transports look like they want to map neatly onto Unix concepts (streams, datagrams, pipes, etc.) but they don't.
I think the main problem is that,
_UnixReadPipeTransport
holds the ownership of the pipe. As mentioned in the comment here, it seems like a design choice from 10 years ago.[...]
I do not know why it is designed this way. (But removing all
self._pipe.close()
does make the above code working 🤣)
This is indeed the main cause of pain in this issue and the core thing that should be addressed. The rest, I suppose, could be written off as conceptual differences between asyncio and Unix, which I still feel are unfortunate, but maybe that's just me.
It's interesting that uvloop doesn't have this issue but _UnixSelectorEventLoop
does. I wonder if it leads to them having to manually close IPC unnamed pipes in other places.
I would say that the terminal convention around ^D is the odd one out. Nothing else can receive data once EOF has been signaled -- with a socket or pipe the EOF is final. The StreamReader and pipe handling are designed around that (IMO more sensible) model. Maybe you can write your own transport for handling your use case?
I would say that the terminal convention around ^D is the odd one out. Nothing else can receive data once EOF has been signaled -- with a socket or pipe the EOF is final.
Not if it's a named pipe (aka FIFO), the behavior is different again for those. Much of my initial report is specifically about detailing those differences.
_UnixReadPipeTransport
explicitly accepts character devices (which terminals are) as valid input, so I do consider it a bug that it doesn't work with them correctly.
I definitely could write my own transport for it, but I find it weird that am supposed to go to those lengths in order to read from a terminal.
I encountered this issue by debugging the debug
command (funnily enough) in platformio -- they use asyncio to build an asynchronous proxy between the user and GDB, which asyncio seems well suited for, and clearly the developers of platformio thought the same. It's odd that it's actually impossible to do with asyncio without delving into the innards of it and writing a custom transport.
The Transports + Protocols architecture was copied from Twisted. I'd like to know what Twisted does in this case.
I'd also like to receive less of an attitude of entitlement. There are exactly two remaining asyncio maintainers left, with very limited time and many other responsibilities; we have to prioritize.
I'd also like to receive less of an attitude of entitlement. There are exactly two remaining asyncio maintainers left, with very limited time and many other responsibilities; we have to prioritize.
I am sorry that I came across as entitled, this is very much not what I want to be. Thank you for pointing this out to me, I will try to check my tone better going forwards.
I don't call any shots here and I neither influence, nor really know what the priorities of the asyncio maintainers are. I have found a feature in asyncio that is easy to misuse, is misused in a some relatively big projects, and requires a workaround to use correctly, and I wanted to report it so that asyncio could handle it in a way that better accommodates this use case. I understand that I will have to write a workaround in my code for the current behavior either way, because even if asyncio was changed to accommodate this use case tomorrow, users can't be expected to immediately update to the latest version of Python.
What I ultimately want is to be helpful, and I apologize if I failed at that.
The Transports + Protocols architecture was copied from Twisted. I'd like to know what Twisted does in this case.
Similarly to asyncio, Twisted reports CONNECTION_DONE
when it encounters EOF, calls connectionLost
on the protocol, and doesn't call dataReceived
on it anymore. It doesn't, however, close the file it was reading from.
I also checked how it's done in Trio, and they don't special-case EOF when reading from a pipe, and the docs make it clear that this API is intended to be used with TTYs.
Sure, thanks for understanding.
Similarly to asyncio, Twisted reports
CONNECTION_DONE
when it encounters EOF, callsconnectionLost
on the protocol, and doesn't calldataReceived
on it anymore. It doesn't, however, close the file it was reading from.
The function you link to just reads from the fd and calls the callback if any data was read. It looks like this convention is actually at the heart of the problem -- when no data is read, the data callback is not called, and instead some other protocol callback is called that indicates EOF. I guess there's a grammar or state diagram for which protocol methods can be called in which order, and it's something like "dataReceived connectionLost". What you are looking for would correspond to "(dataReceived | connectionLost)", but that leaves it unclear to the protocol when it is really done. Calling the callback with an empty argument would break all the (Twisted) code in the world so also wouldn't be a good idea.
I also checked how it's done in Trio, and they don't special-case EOF when reading from a pipe, and the docs make it clear that this API is intended to be used with TTYs.
But Trio doesn't use the Transport/Protocol abstractions, does it? Its Stream
abstraction seems closer to a Unix file descriptor than to a Twisted Protocol/Transport pair.
Anyway, I'm not sure how we could fix this in asyncio without breaking existing, working code. When reading from an actual pipe (the intended use case for connect_read_pipe
), when an empty byte string is received, something must signal the protocol that the other end of the pipe was closed (which is the inevitable conclusion when a pipe returns empty data). The API we have for that is to call eof_received
followed by connection_lost
. The convention for asyncio protocols is that when both of these are called, it's a "normal" EOF, whereas if the connection was closed prematurely, only connection_lost
is called.
As with Twisted, I don't see how we could make it do what you want -- we can't call data_received(b"")
on EOF, since that would break existing protocols, and we can't call data_received
once eof_received
has been called. The Transport unquivocally owns the file descriptor, there's no way to ask for it to give it back or anything like this.
In your application, perhaps you could work around it by using os.dup()
to create a fresh fd to pass to the transport construction, and when you receive EOF, repeat this process. You could even write a protocol that preserves state and can be reused (it's eof_received
and connection_lost
callbacks should not change any state). But that seems the best we can do.
Unless you have a brilliant fix in mind that you were waiting to pull out of your sleeve when all other options are exhausted? :-)
If it's an unnamed pipe, it means that its write end is lost forever, because it can't be referred to in any other way than its file descriptor.
FWIW there is an interesting feature on Linux: you can reopen a pipe from procfs, for example:
>>> import os
>>> r, w = os.pipe()
>>> os.close(w)
>>> os.read(r, 16)
b''
>>> w = os.open(f"/proc/self/fd/{r}", os.O_WRONLY)
>>> os.write(w, b"hello")
5
>>> os.read(r, 16)
b'hello'
BTW, you can open a pipe even with O_RDWR
, and a single pipe fd will be returned for both read and write.
FWIW there is an interesting feature on Linux: you can reopen a pipe from procfs, for example:
Interesting, I didn't know that. Thank you!
The function you link to just reads from the fd and calls the callback if any data was read. It looks like this convention is actually at the heart of the problem -- when no data is read, the data callback is not called, and instead some other protocol callback is called that indicates EOF.
Yes, that is exactly what happens -- the reactor sees the truthy return value and removes the file descriptor from selection, calling connectionLost
on it. Much the same as asyncio does it, except it's not a transport doing it but the reactor.
Anyway, I'm not sure how we could fix this in asyncio without breaking existing, working code. When reading from an actual pipe (the intended use case for
connect_read_pipe
), when an empty byte string is received, something must signal the protocol that the other end of the pipe was closed (which is the inevitable conclusion when a pipe returns empty data). The API we have for that is to calleof_received
followed byconnection_lost
. The convention for asyncio protocols is that when both of these are called, it's a "normal" EOF, whereas if the connection was closed prematurely, onlyconnection_lost
is called.
This is very true. While I originally thought eof_received
could be decoupled from connection_lost
, leaving the protocol to figure out what to do in which case, I can see now that this would break a guarantee that is currently provided by the transport API.
As with Twisted, I don't see how we could make it do what you want -- we can't call
data_received(b"")
on EOF, since that would break existing protocols, and we can't calldata_received
onceeof_received
has been called. The Transport unquivocally owns the file descriptor, there's no way to ask for it to give it back or anything like this.
I don't know how crucial it is to the design that the transport owns the file descriptor (uvloop doesn't seem to do that, as sunmy2019 mentioned earlier in the thread), but honestly even if it didn't the API is very much suited for reading specifically from an unnamed pipe and not a terminal.
In your application, perhaps you could work around it by using
os.dup()
to create a fresh fd to pass to the transport construction, and when you receive EOF, repeat this process. You could even write a protocol that preserves state and can be reused (it'seof_received
andconnection_lost
callbacks should not change any state). But that seems the best we can do.
I'll have to think how best to do that, it's one of the options I've been considering. I thought that writing my own Stream-like class that uses the API for watching file descriptors directly could be cleaner, though it can't exactly conform to the asyncio StreamReader and StreamWriter API, as it also assumes that an EOF is final.
Unless you have a brilliant fix in mind that you were waiting to pull out of your sleeve when all other options are exhausted? :-)
I suppose the only fix would be to add an new API that is specifically designed for reading from and writing to terminals, though I understand that this is a lot of design and implementation work :sweat_smile:
If you are not inclined to do that, which I would very much understand, that kind of leaves asyncio with an API that can be used for terminals, but is misleadingly not quite complete. Perhaps a note could be added to the documentation for connect_read_pipe
that using it for reading from a terminal is not the intended usage, because currently people seem to be using it for that quite a lot.
If you are not inclined to do that, which I would very much understand, that kind of leaves asyncio with an API that can be used for terminals, but is misleadingly not quite complete. Perhaps a note could be added to the documentation for
connect_read_pipe
that using it for reading from a terminal is not the intended usage, because currently people seem to be using it for that quite a lot.
But it's totally fine unless you want to keep reading after receiving EOF, right? At best we should warn in the docs about that.
And indeed I don't want asyncio to grow another API for this corner case. It really belongs in a 3rd party package. IMO we already provide too many APIs (fortunately, at least HTTP support is 3rd party, but unfortunately TLS support is baked in).
But it's totally fine unless you want to keep reading after receiving EOF, right? At best we should warn in the docs about that.
That is correct. Is that documentation addition something that you would like me to do as a PR?
@micha030201 Sure, send a PR and I'll review it.
@micha030201 Please do send in a docs PR. That seems the best way to move forward with this issue. Thanks!
Background
An "end-of-file condition" or EOF in POSIX is what happens when a process makes a read() call on a file descriptor and gets a return value of 0. Files can sometimes continue after an end-of-file condition has been encountered.
asyncio's
connect_read_pipe
on Unix can take files of three types -- FIFO (or unnamed pipe[^S_ISFIFO]), socket, and character device. EOF means different things in each of these file types, so let's go through them one by one.Socket
In docs to read():
In docs to recv():
What that means, however, varies based on the socket type. There are 4 of them: SOCK_DGRAM, SOCK_RAW, SOCK_SEQPACKET, and SOCK_STREAM[^socket_types]. Let's go through them.
SOCK_SEQPACKET and SOCK_STREAM
These are connection-mode, so when whoever we've been talking to shuts the socket down, the connection is closed and no more data will be sent.
Well, almost -- from the Linux man pages:
But provided that we don't call read() with length 0, EOF means that the connection is closed.
SOCK_DGRAM and SOCK_RAW
These, however, are connectionless-mode, so they can't be shut down from the other side, and thus we would never encounter EOF.
With datagram sockets there is a small caveat. In the Linux manual page for recv:
However, on Linux that is not a problem, because it breaks POSIX-compatibility in this instance:
So on Linux, we're still fine. On a POSIX-compatible system, we would confuse receiving a zero-length datagram with a connection being closed.
Raw socket datagrams always include the IP header, so their length can never be 0[^man_raw].
FIFO/pipe
In docs to read():
What this says is that when a reader encounters EOF on a pipe, it means that the all writers have closed their FIFO or pipe file descriptors. It will keep encountering EOF with every call to read() while there are no writers.
If it's an unnamed pipe, it means that its write end is lost forever, because it can't be referred to in any other way than its file descriptor. If it's a FIFO, however, a new process can open it for writing and our reader will stop encountering EOF and will be able to proceed as it had before.
Character device
In docs to read():
That means that Linux or BSD or whoever can figure out whatever each of them wants to do when a reader encounters EOF on a character device. I have not tested any other systems, but Linux seems to generate an EOF exactly once, and then, if the device is still operational, all subsequent reads will proceed as normal.
Application
One of the consequences of that is that programs running in a terminal can keep reading from stdin and generally doing anything they want after the user presses Ctrl+D.
Some programs take advantage of that, one example is GDB -- if you press Ctrl+D while debugging a remote inferior, it will ask for confirmation and warn you that the inferior would be detached. If you change you mind and choose no, it will continue working as it was before.
It can do that without any issue because sending an EOF on a terminal is just a way to communicate with the process reading from it, it doesn't close the file or prevent further communication.
asyncio behavior
When a
_UnixReadPipeTransport
in asyncio encounters EOF, the following happens:https://github.com/python/cpython/blob/aa874326d86734606a1930e7f48e2ee204cae0b6/Lib/asyncio/unix_events.py#L527-L532
It stops reading from that file (unregisters it from selection), and calls
connection_lost
on the protocol.It also destroys itself and closes the file:
https://github.com/python/cpython/blob/aa874326d86734606a1930e7f48e2ee204cae0b6/Lib/asyncio/unix_events.py#L587-L594
This behavior is correct when we're dealing with stream (or technically datagram, but in that case it would just never execute) sockets or unnamed pipes, but it's incorrect if what we're reading from anything else -- a FIFO or a character device.
Consequences
It means that the kind of thing GDB does is not possible with asyncio's
connect_read_pipe
. Here's an example of what doesn't work:Additionally,
stdin
is now closed, so callingconnect_read_pipe
again will not work either.The only ways to work around that on asyncio would be to either monkeypatch
_UnixReadPipeTransport
, or useos.dup
. I suppose one could also wrap thepipe
file object fed toconnect_read_pipe
to return a sentinel instead of an empty bytes object on EOF, and also modify the protocol to interpret that sentinel in the correct way. None of these options are particularly pleasant.The same things works without issue with standard Python file API:
Possible solution
I don't know how Python developers feel about changing the asyncio API in ways that could subtly break programs, but I hope that the argument for correctness and the fact that some fairly important use cases are currently impossible are strong enough to warrant such a change.
Ideally,
_UnixReadPipeTransport
would be changed not to callconnection_lost
on EOF, instead only calling in on exception. The task of interpreting EOF then would be placed upon protocols. The list of protocols already includes specific protocols for datagrams, streams, and unnamed pipes, so it's not a huge stretch to also have specific protocols for FIFOs and character devices. Their behavior is different from all the protocols that are already there, and I would wager that reading from a terminal is a pretty common use case.Alternatively,
connect_read_pipe
could return different transports for different file types, but that would require it to distinguish between unnamed pipes and FIFOs, which I'm not sure is possible.Additionally,
StreamReader
does not work correctly when reading from a named pipe or a character device, as it assumes that EOF is a permanent state -- which is pretty reasonable, it is a stream reader and it works for stream sockets, the issue is that a character device or pipe is not one. It could be modified to accommodate for those types of files, should the responsibility for indicating that the connection is closed when a stream socket receives an EOF be put onProtocol
. That would make stream in this instance an abstraction specific to asyncio and not one that refers to a POSIX stream socket. Otherwise, other types of readers (and probably writers, I didn't even touch on those) should be created. I admit thatCharacterDeviceReader
does not exactly roll off the tongue.I can't really think of less drastic ways to fix this, but perhaps there are some that I'm missing. Anyway, thank you for reading this! This is the first issue I'm submitting in the Python repository and I've tried to make it as detailed as possible.
[^S_ISFIFO]: IEEE Std 1003.1-2017, Standard for Information Technology -- Portable Operating System Interface (POSIX), The Open Group Base Specifications Issue 7, 2018 Edition, ISBN 978-1-5044-4542-9 -- line 13390, page 393
[^socket1]: IEEE Std 1003.1-2017, Standard for Information Technology -- Portable Operating System Interface (POSIX), The Open Group Base Specifications Issue 7, 2018 Edition, ISBN 978-1-5044-4542-9 -- line 57251, page 1772
[^socket2]: IEEE Std 1003.1-2017, Standard for Information Technology -- Portable Operating System Interface (POSIX), The Open Group Base Specifications Issue 7, 2018 Edition, ISBN 978-1-5044-4542-9 -- lines 58076-58078, page 1793
[^socket_types]: IEEE Std 1003.1-2017, Standard for Information Technology -- Portable Operating System Interface (POSIX), The Open Group Base Specifications Issue 7, 2018 Edition, ISBN 978-1-5044-4542-9 -- section 2.10.6, lines 18376-18377, page 524
[^man_recv]: recv(2), Linux man-pages 6.04
[^man_raw]: raw(7), Linux man-pages 6.04
[^fifo]: IEEE Std 1003.1-2017, Standard for Information Technology -- Portable Operating System Interface (POSIX), The Open Group Base Specifications Issue 7, 2018 Edition, ISBN 978-1-5044-4542-9 -- lines 57221-57227, page 1771
[^char_device]: IEEE Std 1003.1-2017, Standard for Information Technology -- Portable Operating System Interface (POSIX), The Open Group Base Specifications Issue 7, 2018 Edition, ISBN 978-1-5044-4542-9 -- lines 57217, page 1771