ipkn / crow

Crow is very fast and easy to use C++ micro web framework (inspired by Python Flask)
BSD 3-Clause "New" or "Revised" License
7.43k stars 889 forks source link

Calling stop() does not properly close all active connections, namely websocket connections. #377

Open pfeatherstone opened 4 years ago

pfeatherstone commented 4 years ago

Calling stop() does not properly close all active connections, namely websocket connections.

pfeatherstone commented 4 years ago

So i now manually keep track of websocket connections, and before i call stop(), i iterate through all the connections, and call close(). Now it turns out the handler that is dispatched at line 6898 in crow_all.h never gets called. If you do a bit more tracking it goes into boost.asio, at which point I loose track of the stack trace. Can anybody help? I know this repo is dead, but just checking in case some users have encountered similar problems.

The-EDev commented 3 years ago

I'll try to look into it if i have time. if you can provide some minimal example where the issue appears it would be great.

pfeatherstone commented 3 years ago

Don't worry. I use a combination of https://github.com/yhirose/cpp-httplib and https://github.com/zaphoyd/websocketpp.git. I also sometimes use https://github.com/civetweb/civetweb though i get hanging issues with the last one.

The-EDev commented 3 years ago

So I looked through the code (I know you said not to worry), It seems that stop() and close() do completely different things.

close() sends a close signal and once a confirmation is received, it closes the socket adaptor.

stop() on the other hand shuts down the IO services that all sockets use (HTTP request sockets or WebSockets). And while that works does work, the client just never receives any notice.

jf99 commented 3 years ago

This is indeed a problem if you stop() your server and restart it afterwards without much waiting. It will then fail to start with an exception:

terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::system::system_error> >'
  what():  bind: Address already in use

Calling lsof gives you the reason:

$ lsof -i :18080
COMMAND     PID       USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
my-server 23405 jf99   44u  IPv4 149122      0t0  TCP localhost:18080->localhost:40806 (CLOSE_WAIT)

This is with reuse_addr = true in the acceptor's constructor (line 9269 in crow_all.h).

The-EDev commented 3 years ago

From what I've seen in the code (and I might be totally wrong), the primary issue is that Crow doesn't keep track of connections, (with the exception of HTTP connections in debug mode).

And the only solution I can think of is what @pfeatherstone was doing but in crow itself, which is keep track of all active connections and send a close signal right before stopping the server as part of stop().

sbenner commented 3 years ago

This is indeed a problem if you stop() your server and restart it afterwards without much waiting. It will then fail to start with an exception:

terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::system::system_error> >'
  what():  bind: Address already in use

Calling lsof gives you the reason:

$ lsof -i :18080
COMMAND     PID       USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
my-server 23405 jf99   44u  IPv4 149122      0t0  TCP localhost:18080->localhost:40806 (CLOSE_WAIT)

This is with reuse_addr = true in the acceptor's constructor (line 9269 in crow_all.h).

As far as I remember there're OS settings which affect the CLOSE_WAIT timeouts e.g. net.ipv4.tcp_fin_timeout and tcp_keepalive_time etc. so it might not always be the case if you tweak your OS for high concurrency and load - to keep these sockfds closed in a timeley manner. TYVM

pfeatherstone commented 3 years ago

Like everything else in boost, it seems boost.asio is more complicated than what it needs to be.