nthallen / monarch

Monarch Data Acquisition System
0 stars 1 forks source link

TIME_WAIT #83

Open nthallen opened 4 years ago

nthallen commented 4 years ago

When TCP daemons terminate, their ports are not immediately available for a restart. They need to wait until TIME_WAIT elapses, which is fairly long. This could make rapid restarting difficult. At the very least, it requires some thought, since it currently interferes with startup.

Symptom is:

[FATAL] bfr: TCP: bind(0.0.0.0, 1350) failed with error 98: Address already in use

and

nort@ubuntu:~/Exp/Bootstrap$ netstat -na | grep :1350
tcp        0      0 127.0.0.1:1350          127.0.0.1:50746         TIME_WAIT 
nthallen commented 4 years ago

The problem may be associated with an unclean shutdown of some sort. bfr and Bootstrapsrvr are both listening and using their ports, but only bfr's socket gets stuck in TIME_WAIT. Need to take better care to ensure that both sides do an orderly shutdown of socket connections.

nthallen commented 4 years ago

No, you always get a TIME_WAIT when ending a connection. If both processes are on the same node, then essentially both ports are blocked for the duration, but if on different nodes, then it appears that the process that closes first ends up with the TIME_WAIT condition. As such, if we can adjust our protocols so the clients close first, we should avoid having servers blocked from listening on their established ports. Since the clients choose essentially random port numbers, they are less likely to be bothered by the TIME_WAIT problem.