Open flumm opened 3 days ago
Please share the actual code you are using. Your example async_io
violates the API contract of async_io
by returning WouldBlock
without having received that error from the OS in a syscall on the socket.
the original code in the async_io callback is this:
let mut buf = [0; HANDSHAKE_BYTES_LEN];
let raw_fd = incoming_stream.as_raw_fd();
let std_stream =
unsafe { ManuallyDrop::new(std::net::TcpStream::from_raw_fd(raw_fd)) };
let peek_res = std_stream.peek(&mut buf);
match peek_res {
Ok(peek_len) if peek_len < HANDSHAKE_BYTES_LEN => {
Err(io::ErrorKind::WouldBlock.into())
}
res => res.map(|_| contains_tls_handshake_fragment(&buf)),
}
(see the original code: https://git.proxmox.com/?p=proxmox.git;a=blob;f=proxmox-rest-server/src/connection.rs;h=3815a8f40c082410c8e0091f4a57541f616003d2;hb=refs/heads/master#l473 )
the reason we do this is because we want to look at the data sent from the client and decide if it's a tls handshake, so we either continue with an encrypted connection or handle an unencrypted HTTP request
while i understand that this violates the contract of async_io
it's still unexpected that this messes with the timeout in a way that this never finishes, since the contract only talks about the socket in this case, not about how the future is polled/executed...
(it says:
[...] which can cause the socket to behave incorrectly.
)
Yeah, that definitely violates the API contract. Unfortunately the OS provides no way to do what you want, so you're out of luck. It's not a Tokio limitation.
As for the issue itself, I guess the request is to fail more gracefully when the API contract is violated?
One question is why the timeout future is not being scheduled anymore when running into this.
I get that we're holding this wrong and FWIW, we have a stop-gap for this specific case where we check if the peek_len
from the previous call is different than the current one, when they stay the same we can assume that the socket became ready without new data being available and thus abort the waiting. But yes, that probably cannot work for the generic case.
But that we have to do that to avoid spinning completely is not what is unexpected to me, rather, for the current code I'd have expected that if the client disconnects the tokio timeout will still get hit after 10s, leading "only" to at max 10s of a spinning thread, i.e. due to us holding it wrong. I.e., why does the timeout future make no progress anymore, and why isn't it taken over by another thread if this one is blocked?
The problem is that async_io
ends up blocking the thread.
The problem is that
async_io
ends up blocking the thread.
ok while i get that, two things are still surprising here:
having a timeout future that is not scheduled even when there are multiple idle threads available (i also tried a "manual" timeout future with sleep + select!, but that didn't work too)
the questions are: should async_io "yield" (or the rust equivalent) between async_io callback calls ? and should tokio schedule the timeout future more intelligently?
The timeout future can't be scheduled for execution because it's already busy running. It's stuck inside this call to poll
, which never returns as the future is blocking the thread.
It is reasonable to make async_io
behave less badly when you hold it wrong, though we shouldn't make correct usage worse by yielding unnecessarily.
Version
tokio v1.41.1 tokio-macros v2.4.0
Platform Linux hostname 6.11.0-1-pve #1 SMP PREEMPT_DYNAMIC PMX 6.11.0-1 (2024-10-23T15:32Z) x86_64 GNU/Linux
Description Is it expected behaviour that i can easily prevent a timeout from happening with
async_io
on closed connections?I tried this code:
(In our real code we call peek on the underlying socket to check if there is enough data available, and if not, we return WouldBlock)
After that, i connect to the server with
nc localhost 65000
If i cancel the nc connection (with CTRL+C) before the 5 seconds, the timeout is never reached and it loops forever in the async_io callback If i don't close the connection (regardless if i send data or not), the timeout is reached normally.
I would have expected the timeout to trigger in both cases, but it does not.