ntop / n2n

Peer-to-peer VPN
GNU General Public License v3.0
6.07k stars 927 forks source link

Slow shutdown on Windows #1087

Open Legend-Master opened 1 year ago

Legend-Master commented 1 year ago

Shutting down the edge will take a while most of the times

After doing some debugging, it seemed that select function is blocking us from shutting down (up to 10s when not waiting for server response)

https://github.com/ntop/n2n/blob/709590d66e4aed880b6fc2338d516d20437c17ad/src/edge_utils.c#L2876

Legend-Master commented 1 year ago

@hamishcoleman May I ask why do we have such a long wait time that blocks exiting?

And I don't really think this is necessary, since we'll exit after this and stop reading anyway

https://github.com/ntop/n2n/blob/6ecc201017729e3065905fe7ef085d2d1f49f06c/src/edge_utils.c#L3026-L3028

Legend-Master commented 1 year ago

https://github.com/ntop/n2n/pull/1111#issuecomment-1601315352

@hamishcoleman I added some logging and this is what I got, note that the time waiting for tun read thread is going to be much longer if you keep edge open for a while

And it did not go through Sleep(INFINITE) at all

22/Jun/2023 10:12:03 [edge_utils.c:2597] [OK] edge <<< ================ >>> supernode
22/Jun/2023 10:12:03 [edge_utils.c:2939] start waiting for select
22/Jun/2023 10:12:03 [edge_utils.c:2941] finish waiting for select
22/Jun/2023 10:12:03 [edge_utils.c:2939] start waiting for select
22/Jun/2023 10:12:13 [edge_utils.c:2941] finish waiting for select
22/Jun/2023 10:12:13 [edge_utils.c:2939] start waiting for select
22/Jun/2023 10:12:13 [edge_utils.c:2941] finish waiting for select
22/Jun/2023 10:12:13 [edge_utils.c:2939] start waiting for select
22/Jun/2023 10:12:23 [edge_utils.c:2941] finish waiting for select
22/Jun/2023 10:12:23 [edge_utils.c:2939] start waiting for select
22/Jun/2023 10:12:23 [edge_utils.c:2941] finish waiting for select
22/Jun/2023 10:12:23 [edge_utils.c:2939] start waiting for select
22/Jun/2023 10:12:23 [edge_utils.c:2941] finish waiting for select
22/Jun/2023 10:12:23 [edge_utils.c:2939] start waiting for select
22/Jun/2023 10:12:33 [edge_utils.c:2941] finish waiting for select
22/Jun/2023 10:12:33 [edge_utils.c:2939] start waiting for select
22/Jun/2023 10:12:43 [edge_utils.c:2941] finish waiting for select
22/Jun/2023 10:12:43 [edge_utils.c:2939] start waiting for select
22/Jun/2023 10:12:43 [edge_utils.c:2941] finish waiting for select
22/Jun/2023 10:12:43 [edge_utils.c:2939] start waiting for select
22/Jun/2023 10:12:43 [edge_utils.c:2941] finish waiting for select
22/Jun/2023 10:12:43 [edge_utils.c:2939] start waiting for select
22/Jun/2023 10:12:46 [edge.c:977] shutting down...
22/Jun/2023 10:12:53 [edge_utils.c:2941] finish waiting for select
22/Jun/2023 10:12:53 [edge_utils.c:3061] start waiting for tun read thread
22/Jun/2023 10:13:06 [edge_utils.c:3063] finish waiting for tun read thread
22/Jun/2023 10:13:06 [edge_utils.c:2864] **********************************
22/Jun/2023 10:13:06 [edge_utils.c:2865] Packet stats:
22/Jun/2023 10:13:06 [edge_utils.c:2866]       TX P2P: 0 pkts
22/Jun/2023 10:13:06 [edge_utils.c:2867]       RX P2P: 0 pkts
22/Jun/2023 10:13:06 [edge_utils.c:2868]       TX Supernode: 18 pkts (18 broadcast)
22/Jun/2023 10:13:06 [edge_utils.c:2869]       RX Supernode: 0 pkts (0 broadcast)
22/Jun/2023 10:13:06 [edge_utils.c:2870] **********************************
Legend-Master commented 1 year ago

I think the problem with select is that, on Linux

https://man7.org/linux/man-pages/man7/signal.7.html

Interruption of system calls and library functions by signal handlers If a signal handler is invoked while a system call or library function call is blocked, then either:

  • the call is automatically restarted after the signal handler returns; or

  • the call fails with the error EINTR.

This is not the case on Windows

hamishcoleman commented 1 year ago

Yes, that is true, however the select timeout is set at 10 seconds, so it should not take 3-9 minutes.

Legend-Master commented 1 year ago

Yes, that is true, however the select timeout is set at 10 seconds, so it should not take 3-9 minutes.

From that log, select was taking exactly 10 seconds, but https://github.com/ntop/n2n/blob/6ecc201017729e3065905fe7ef085d2d1f49f06c/src/edge_utils.c#L3026-L3028 took 13

I think the extremely long time is caused by waiting tun read thread, it takes no time if you ctrl c immediately, but if you leave it running for a while, it will wait a long time

momentforever commented 5 months ago

has this bug been fixed?

Legend-Master commented 5 months ago

Not yet, see #1121

hamishcoleman commented 5 months ago

No work is currently being done on this repository.

However, instead of the large patch linked above, you could apply this one line patch https://github.com/hamishcoleman/n3n/commit/4e282a1c37704bcd05b465a00152cf4e3d1b11fe

Legend-Master commented 5 months ago

There're 2 different blockers

hamishcoleman commented 5 months ago

Select not returning properly on Windows is also solvable with a couple of different one-line patches. One solution for select not returning properly is to simply wait 10 seconds on shutdown - as that is quite short compared to the potentially infinite timeout for a new tap packet to arrive (and a lot less invasive than adding a whole lot of windows specific threading code). Either way, no changes are being made to this repository

Legend-Master commented 5 months ago

Do we wanna consider reduce the timeout on Windows as an alternative? Maybe like 1 second? Close window timeout is around 5 seconds, and if we can't clean up in that time, the whole program is terminated

hamishcoleman commented 5 months ago

Sorry, I'm not considering anything in this repository.

Legend-Master commented 5 months ago

Alright

hamishcoleman commented 5 months ago

@Legend-Master , I've been thinking about the various "one-line" options I mentioned above and believe that the simplest is just to close one of the sockets that is included in the select() fdset - see https://github.com/hamishcoleman/n3n/commit/7427a712fb66dcd3bdf5ed1e743fa90253820f92 for the implementation I ended up using

Legend-Master commented 5 months ago

That's a solution too, since we're shutting down anyway, it makes sense 😀

momentforever commented 5 months ago

@Legend-Master , will this be fixed in the next release?

Legend-Master commented 5 months ago

I think there're some CLA problems currently, and the next release won't be coming soon

You can use https://github.com/hamishcoleman/n3n for now though