iroh::net::Endpoint::close should not be fallible

CGamesPlay commented 2 weeks ago

I was advised to open an issue in https://github.com/n0-computer/iroh/discussions/2860. The problem is that Endpoint::close is async and fallible.

Fallible. I investigated, the method cannot fail unless sending Shutdown to the Actor fails, but this isn't actually a failure: this send can only fail if all recievers are dropped, and the receiver is owned by Actor::run. Logically, if this method fails, the socket is already technically shut down.

Further, the Actor::run method cannot fail except via panicking or if the network monitor fails to initialize. It seems clear to me that having initialization code fail after the MagicSock is initialized is a different bug: either the error shouldn't stop the entire Actor, or the fallible code should have happened before the initializer returned Ok. The only other way this function can return is by receiving ActorMessage::Shutdown.

Also, if the method fails, self gets dropped anyways, so there's no way to handle the error.

Async. Quinn's API for closing endpoints is better, and I think would cleanly apply to Iroh: close can be called from anywhere and wait_idle can be called from my main server task to gracefully shut down. Discussion below provides rationale for keeping it async.

It's also unclear what happens if you cancel the close. Does the endpoint stay open? Does it close itself in the background?

Also, it's strange that Connection::close is sync, and that works just fine.

matheus23 commented 2 weeks ago

Also, it's strange that Connection::close is sync, and that works just fine.

IMO this is a pitfall: If you call close, then drop the Endpoint, you might not have actually closed the connection gracefully, because the close frame could be lost in transit. Endpoint::wait_idle (is badly named and) makes sure that there's no more open connections (including connections where we're just waiting for an ACK for a close frame).

I see a major footgun in having iroh_net::Endpoint::close be synchronous without a call to quinn::Endpoint::wait_idle, as then people will expect this to work and just drop the endpoint. And now connection closes are flaky.

IMO it makes sense that a call to Endpoint::close, which does things on the network, is async.

It's also unclear what happens if you cancel the close. Does the endpoint stay open? Does it close itself in the background?

The close call may or may not go through. It will close in the background iff you don't drop the Endpoint before it manages to finish ACKing.

CGamesPlay commented 2 weeks ago

The cancelation story is weird and is why I don't think having this method be async makes sense. If you call close on an Endpoint, and cancel the future midway through (during the wait_idle period), the Endpoint is going to be in an inconsistent state. The server will already be disabled but some connections will still be alive. The state transition from "operational" to "closing" happens in the first poll; after this point the server stops accepting new connections and existing connections are already in the process of being closed. All tasks that were working with this connection will stop themselves now (LocallyClosed error), because they see the graceful shutdown has started. But here msock.close will never get called, just dropped later on (this may lead to a task / memory leak depending on the implementation; I haven't checked).

As a user of the library, I have some task that is accepting connections on this endpoint, and it's conceptually the "manager" of the endpoint. When my application receives a command to shut down, I need to do so gracefully for a time, and then forcefully. Right now, calling close does accomplish this goal, for the reasons above, but the fact that it works feels like an accident rather than by design, and (I haven't tested, but) it might be holding on to the socket, preventing rebinding. The only safe way to do it is to implement an out-of-band stop signal that goes to my "manager" task.

I propose making two methods, one which synchronously transitions the endpoint to "closing", and a second that waits for the endpoint to actually be closed. If this were the case, my shutdown command would call the synchronous method, then the "manager" task, as part of its orderly shutdown, would wait for the connection to be fully closed before exiting.

It seems like we are in agreement that the method shouldn't be fallible, at least. I also think it's very strange for a cheaply-cloned object to move self in an async method. It should probably take &self instead: if the logic of the shutdown process doesn't work with &self, then it also doesn't work with clones of the object.

matheus23 commented 2 weeks ago

Right now, calling close does accomplish this goal, for the reasons above, but the fact that it works feels like an accident rather than by design

Gracefully shutting down is exactly what the Endpoint::close method is designed for.

matheus23 commented 2 weeks ago

I propose making two methods, one which synchronously transitions the endpoint to "closing", and a second that waits for the endpoint to actually be closed. If this were the case, my shutdown command would call the synchronous method, then the "manager" task, as part of its orderly shutdown, would wait for the connection to be fully closed before exiting.

We are intentionally avoiding exposing a synchronous Endpoint::close method, because most people will just not know that they need to have another "manager" task that waits for orderly shutdown before dropping the Endpoint.

CGamesPlay commented 2 weeks ago

Well, ultimately I think that this is a minor inconvenience (Iroh can easily do something, but doesn't expose the API, but I can semi-easily recreate the behavior), so no need to push on it. I've removed the async portion from the issue description.

I do think the fallibility issue, move-self semantics issue, and cancel safety issue still stand, but feel free to close this if the team disagrees.

[append] The cancel-safety one is hypothetical, actually. There won't be any problem as long as the magic socket shuts itself down when the endpoint is fully dropped.

flub commented 2 weeks ago

The move-self is odd because Endpoint is Clone anyway, so it doesn't really achieve anything. I think it was put in there as a guidance more than anything else.

I also think the fallibility analysis is interesting. I think normally this would never fail, unless there was a bug in the implementation already. So the fallibility is only there to catch implementation bugs. Question is whether you want the application to know there was an implementation bug or not? It can't do much about it, it may sometimes result in a bug report but I guess in the vast majority of cases that wouldn't be the case.

I guess cancel-safety needs to be documented clearly so users can understand the semantics.

CGamesPlay commented 2 weeks ago

Question is whether you want the application to know there was an implementation bug or not?

I would think that anyone interested in detecting implementation bugs would have a panic handler set up to catch them.

n0-computer / iroh

iroh::net::Endpoint::close should not be fallible #2867