Closed RobertElder closed 10 years ago
why is this needed? We were already doing a double-block to ensure all nodes are synchronized before killing them. Why do we need to make them block a third time?
Same kind of problem we had in
https://github.com/Hudon/spike/pull/77
if we don't synchronize the process to a state where it's "totally done everything and just waiting for the exit signal", then a race condition can occur where the process exits, and any messages it has sent will disappear causing a deadlock.
Previous to this commit, the last line is a send_pyobj which is non-blocking so the process is free to exit as soon as it sends the message.
I still need to do an 8-hour deadlock test to be sure, but it seems to fix the deadlock on travis, and I haven't been able to get it deadlock on my desktop again in about a half hour of run time.
gotchya. We can instead remove code rather than adding: ie. remove the last send_pyobj
in those three nodes' code.. this way the last thing they do is recv.
Then, in _communticate
, do a recv
only for commands that are not kill:
if message['cmd'] != 'kill':
response = socket.recv_pyobj()
Sounds good. I think we would also need to remove line 212 of https://github.com/Hudon/spike/blob/master/src/distributiond.py
self.listener_socket.send_pyobj({'result': 'ack'})
hmm... we might want to use the daemon's response for error-handling later... but yea, either remove that or add to the if
:
if message['cmd'] != 'kill' or addr != self.worker_addr:
response = socket.recv_pyobj()
other than that :+1:
merge away :+1:
Off by one error.