Git storage file system watcher is leaked

geigerzaehler commented 2 years ago

librad leaks file system watchers. When a librad Peer is dropped there remain threads which watch the file system. When a lot of Peers are created e.g. in tests this leads to “Too many open files” errors. This issue may also cause “Bad file descriptor errors” upstream is seeing on CI.

The leak is caused by spawned threads here https://github.com/radicle-dev/radicle-link/blob/bfdc3ec0c7ba542f2c9fc4f814f76fbd6fe6b2bc/librad/src/net/protocol/cache.rs#L89-L92

and here https://github.com/radicle-dev/radicle-link/blob/bfdc3ec0c7ba542f2c9fc4f814f76fbd6fe6b2bc/librad/src/net/protocol/cache.rs#L139-L142

kim commented 2 years ago

Dropping a Peer does not drop the protocol state.

geigerzaehler commented 2 years ago

Dropping a Peer does not drop the protocol state.

Not sure about the exact mechanism but the daemon, if it is used correctly, does make an effort to properly shutdown the protocol.

kim commented 2 years ago

Not sure about the exact mechanism but the daemon, if it is used correctly, does make an effort to properly shutdown the protocol.

The Peer holds on to Caches in order to share it across protocol restarts. Since the protocol itself also holds a reference to it, the file watchers won't be dropped until the protocol is dropped. From the linked code, you can see that the thread stops once the events iterator is exhausted (which it is when the Watcher is dropped, and there is no other way because Rust developers like their RAII).

Iow, if the protocol outlives the Peer, it will leak resources (and not limited to caches). Unfortunately, I don't know of a way to ensure this statically.

The only ways I can think of to fix this would be to either not share the caches (which can incur significant startup delay on large repositories), persist caches to disk (uh-oh), or tie the lifetime of the Peer handle to that of the protocol it spawned (ie. disallow protocol restarts). The latter would mean that callers wishing to outlive the Peer need to manage a mutable reference to it, respectively guard it behind a mutex, which isn't great either.

geigerzaehler commented 2 years ago

Thanks for clarifying this. I guess this means that the daemon doesn’t shut down the protocol properly. Am I right in assuming that dropping the Peer and calling the function returned by Bound::accept() while completing the future shuts down everything? Or is there something missing from radicle_daemon::peer::Peer::start()?

kim commented 2 years ago

Am I right in assuming that dropping the Peer and calling the function returned by Bound::accept() while completing the future shuts down everything?

It should.

Or is there something missing from radicle_daemon::peer::Peer::start()?

Hm, I see that this actually consumes self which is not Clone. So I’m not sure how we could end up with a “dangling” Peer. Is this used somehow differently in upstream?

geigerzaehler commented 2 years ago

It should.

And it looks like it does on latest master. I’m able to reproduce the “Too many open files” error on a previous version of radicle-proxy but not anymore. Using lsof also confirmed that inotify was the issue.

radicle-dev / radicle-link

Git storage file system watcher is leaked #750