Current mechanism (setting SO_KEEPALIVE via sockopt) can take up to 2h to detect invalid connection state.
I propose incorporating heartbeat messages into protocol and the following algorithm for handling dead connection discovery:
Set timer to fire each x seconds with heartbeat signal.
Handle this signal by:
For every connection:
If connection has set probably_alive then negate probably_alive and send heartbeat to that connection.
Else remove the connection from pool
On receiving from connection answer for heartbeat set probably_alive on the connection.
Current mechanism (setting SO_KEEPALIVE via sockopt) can take up to 2h to detect invalid connection state. I propose incorporating heartbeat messages into protocol and the following algorithm for handling dead connection discovery: