n0-computer / iroh

peer-2-peer that just works
https://iroh.computer
Apache License 2.0
2.61k stars 165 forks source link

fix(iroh-relay): do not use spawn_blocking in stun handler #2924

Closed dignifiedquire closed 1 week ago

dignifiedquire commented 1 week ago

Description

This resulted in memory accumulation , due to blocking threads not getting shutdown, once started. According to the docs from tokio, there is no way to cleanup these tasks/threads, once they have been started.

From the docs: https://docs.rs/tokio/latest/tokio/task/fn.spawn_blocking.html

Be aware that tasks spawned using spawn_blocking cannot be aborted because they are not async. If you call abort on a spawn_blocking task, then this will not have any effect, and the task will continue running normally. The exception is if the task has not started running yet; in that case, calling abort may prevent the task from starting.

When you shut down the executor, it will wait indefinitely for all blocking operations to finish. You can use shutdown_timeout to stop waiting for them after a certain timeout. Be aware that this will still not cancel the tasks — they are simply allowed to keep running after the method returns. It is possible for a blocking task to be cancelled if it has not yet started running, but this is not guaranteed.

The referred to limit is 512 by default, meaning with the default thread stack size of 2MiB, we are looking at 1GiB of memory, simply for keeping around these threads.

Looking at https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.thread_keep_alive these threads should get cleaned up after 10s, but this does not seem to happen. So there seems to be some unintended lingering to happen, in either case.

Breaking Changes

None

Notes & open questions

This is unfortuante, but the parsing of stun packets should be fast enough.

Change checklist

github-actions[bot] commented 1 week ago

Documentation for this PR has been generated and is available at: https://n0-computer.github.io/iroh/pr/2924/docs/iroh/

Last updated: 2024-11-13T22:50:56Z

github-actions[bot] commented 1 week ago

Netsim report & logs for this PR have been generated and is available at: LOGS This report will remain available for 3 days.

Last updated for commit: 9e1bb5b

divagant-martian commented 1 week ago

the spawn blocking was introduced in #1091, do you or @flub who approved it remember why you thought it was necessary?

flub commented 1 week ago

From my memory spawn_blocking was introduced as a knee-jerk reaction for some behaviour we didn't understand. I recall looking several times at this and wondering if the threads overhead really was better than this tiny bit of computation... but didn't dare to touch it as it was a response to an incident. I'm all for removing it, we can do better now (I mean both evaluating and responding to incidents as well as writing a scalable server).

Arqu commented 1 week ago

the spawn blocking was introduced in #1091, do you or @flub who approved it remember why you thought it was necessary?

Oh boy that old code had like task::spawn > spawn_blocking > spawn > spawn_blocking > spawn ...

Im guessing we shouldn't have perf issues with blocked/timed out STUN requests given things are async + the handle_stun_request is in it's own task. I assume the sock.send_to will eventually time out?

matheus23 commented 1 week ago

Quoting dig from the depths of the discord archive, #1091 was introduced because:

I also added some potential improvements to the stun server to make it scale better here not totally sure how much sense it makes though to spawn a task for each incoming req

Flub:

no way that scaling is the issue right now

Dig:

yes, but I wanted to make sure it wasn't

Sounds like this was some fairly optional change.

dignifiedquire commented 1 week ago

merging, if this turns out to be a problem, we have better ways looking at them now as @flub pointed out