project8 / hornet

Hornet is a nearline data processor for the Project 8 experiment
Other
0 stars 0 forks source link

Sometimes hornet doesn't fully quit when it receives a SIGINT #6

Closed nsoblath closed 9 years ago

nsoblath commented 9 years ago

All threads appear to have stopped:

^C2015/06/03 16:28:09 [watcher] inotify error on directory watch read: interrupted system call
2015/06/03 16:28:09 [hornet] thread error!  cannot continue running
2015/06/03 16:28:09 [hornet] stopping 11 threads
2015/06/03 16:28:09 [worker] stopping on interrupt.
2015/06/03 16:28:09 [worker 3.1] no work remaining.  total of 0 jobs processed.
2015/06/03 16:28:09 [worker] stopping on interrupt.
2015/06/03 16:28:09 [worker 4.1] no work remaining.  total of 0 jobs processed.
2015/06/03 16:28:09 [worker] stopping on interrupt.
2015/06/03 16:28:09 [worker 0.1] no work remaining.  total of 0 jobs processed.
2015/06/03 16:28:09 [worker] stopping on interrupt.
2015/06/03 16:28:09 [worker 1.1] no work remaining.  total of 0 jobs processed.
2015/06/03 16:28:09 [classifier] stopping on interrupt.
2015/06/03 16:28:09 [classifier] finished.
2015/06/03 16:28:09 [amqp sender] stopping on interrupt.
2015/06/03 16:28:09 [amqp sender] finished.
2015/06/03 16:28:09 [mover] stopping on interrupt.
2015/06/03 16:28:09 [mover] finished.
2015/06/03 16:28:09 [worker] stopping on interrupt.
2015/06/03 16:28:09 [worker 2.2] no work remaining.  total of 1 jobs processed.
2015/06/03 16:28:09 [shipper] stopping on interrupt.
2015/06/03 16:28:09 [shipper] finished.
2015/06/03 16:28:09 [scheduler] stopping on interrupt
2015/06/03 16:28:09 [amqp receiver] stopping on interrupt.
2015/06/03 16:28:09 [amqp receiver] finished.
Killed
nsoblath commented 9 years ago

Here's an example that works:

^C2015/06/03 16:33:08 [hornet] termination requested...
2015/06/03 16:33:08 [hornet] stopping 11 threads
2015/06/03 16:33:08 [worker] stopping on interrupt.
2015/06/03 16:33:08 [worker 1.1] no work remaining.  total of 0 jobs processed.
2015/06/03 16:33:08 [worker] stopping on interrupt.
2015/06/03 16:33:08 [worker 2.1] no work remaining.  total of 0 jobs processed.
2015/06/03 16:33:08 [worker] stopping on interrupt.
2015/06/03 16:33:08 [worker 3.1] no work remaining.  total of 0 jobs processed.
2015/06/03 16:33:08 [worker] stopping on interrupt.
2015/06/03 16:33:08 [worker 4.1] no work remaining.  total of 0 jobs processed.
2015/06/03 16:33:08 [classifier] stopping on interrupt.
2015/06/03 16:33:08 [classifier] finished.
2015/06/03 16:33:08 [amqp sender] stopping on interrupt.
2015/06/03 16:33:08 [amqp sender] finished.
2015/06/03 16:33:08 [watcher] stopping on interrupt.
2015/06/03 16:33:08 [mover] stopping on interrupt.
2015/06/03 16:33:08 [mover] finished.
2015/06/03 16:33:08 [worker] stopping on interrupt.
2015/06/03 16:33:08 [worker 0.2] no work remaining.  total of 1 jobs processed.
2015/06/03 16:33:08 [shipper] stopping on interrupt.
2015/06/03 16:33:08 [shipper] finished.
2015/06/03 16:33:08 [scheduler] stopping on interrupt
2015/06/03 16:33:08 [amqp receiver] stopping on interrupt.
2015/06/03 16:33:08 [amqp receiver] finished.
2015/06/03 16:33:08 [hornet] All goroutines finished.  terminating...
nsoblath commented 9 years ago

It seems to be due to whether or not the SIGINT interrupts the inotify system call, or whether it's caught by hornet. In the former case, hornet doesn't exit. In the latter case, it's fine.

It looks like the defer statements for the watcher are never called. I don't know why that is. The behavior is the same whether the function returns from the case block, or whether the runLoop is broken.

nsoblath commented 9 years ago

Fixed in commit b5d984f.

The problem turned out to be the sending of the StopExecution to the controlQueue. In cases where one of the threads had quit already, this was a blocking call since there was one thread not present to receive from the connection. To fix this I used the select-w/-default idiom to avoid hanging when one or more of the threads is missing.

I also added a timeout to the pool.Wait call, in case a thread for some reason doesn't quit or doesn't quit correctly, and isn't subtracted from the WaitGroup.