Closed geigerzaehler closed 2 years ago
Dropping a Peer
does not drop the protocol state.
Dropping a
Peer
does not drop the protocol state.
Not sure about the exact mechanism but the daemon, if it is used correctly, does make an effort to properly shutdown the protocol.
Not sure about the exact mechanism but the daemon, if it is used correctly, does make an effort to properly shutdown the protocol.
The Peer
holds on to Caches
in order to share it across protocol restarts.
Since the protocol itself also holds a reference to it, the file watchers won't
be dropped until the protocol is dropped. From the linked code, you can see that
the thread stops once the events iterator is exhausted (which it is when the
Watcher
is dropped, and there is no other way because Rust developers like
their RAII).
Iow, if the protocol outlives the Peer
, it will leak resources (and not
limited to caches). Unfortunately, I don't know of a way to ensure this
statically.
The only ways I can think of to fix this would be to either not share the caches
(which can incur significant startup delay on large repositories), persist
caches to disk (uh-oh), or tie the lifetime of the Peer
handle to that of the
protocol it spawned (ie. disallow protocol restarts). The latter would mean that
callers wishing to outlive the Peer
need to manage a mutable reference to it,
respectively guard it behind a mutex, which isn't great either.
Thanks for clarifying this. I guess this means that the daemon doesn’t shut down the protocol properly. Am I right in assuming that dropping the Peer
and calling the function returned by Bound::accept()
while completing the future shuts down everything? Or is there something missing from radicle_daemon::peer::Peer::start()
?
Am I right in assuming that dropping the Peer and calling the function returned by Bound::accept() while completing the future shuts down everything?
It should.
Or is there something missing from radicle_daemon::peer::Peer::start()?
Hm, I see that this actually consumes self
which is not Clone
. So I’m not sure how we could end up with a “dangling” Peer
. Is this used somehow differently in upstream?
It should.
And it looks like it does on latest master. I’m able to reproduce the “Too many open files” error on a previous version of radicle-proxy
but not anymore. Using lsof
also confirmed that inotify was the issue.
librad leaks file system watchers. When a librad
Peer
is dropped there remain threads which watch the file system. When a lot ofPeer
s are created e.g. in tests this leads to “Too many open files” errors. This issue may also cause “Bad file descriptor errors” upstream is seeing on CI.The leak is caused by spawned threads here https://github.com/radicle-dev/radicle-link/blob/bfdc3ec0c7ba542f2c9fc4f814f76fbd6fe6b2bc/librad/src/net/protocol/cache.rs#L89-L92
and here https://github.com/radicle-dev/radicle-link/blob/bfdc3ec0c7ba542f2c9fc4f814f76fbd6fe6b2bc/librad/src/net/protocol/cache.rs#L139-L142