Off by one message error could cause deadlocks.

hudon / spike

Brain Simulator Parallelization

http://nengo.ca/

1 stars 1 forks source link

Off by one message error could cause deadlocks. #90

Closed RobertElder closed 10 years ago

RobertElder commented 10 years ago

Off by one error.

hudon commented 10 years ago

why is this needed? We were already doing a double-block to ensure all nodes are synchronized before killing them. Why do we need to make them block a third time?

RobertElder commented 10 years ago

Same kind of problem we had in

https://github.com/Hudon/spike/pull/77

if we don't synchronize the process to a state where it's "totally done everything and just waiting for the exit signal", then a race condition can occur where the process exits, and any messages it has sent will disappear causing a deadlock.

Previous to this commit, the last line is a send_pyobj which is non-blocking so the process is free to exit as soon as it sends the message.

I still need to do an 8-hour deadlock test to be sure, but it seems to fix the deadlock on travis, and I haven't been able to get it deadlock on my desktop again in about a half hour of run time.

hudon commented 10 years ago

gotchya. We can instead remove code rather than adding: ie. remove the last send_pyobj in those three nodes' code.. this way the last thing they do is recv.

Then, in _communticate, do a recv only for commands that are not kill:

if message['cmd'] != 'kill':
  response = socket.recv_pyobj()

RobertElder commented 10 years ago

Sounds good. I think we would also need to remove line 212 of https://github.com/Hudon/spike/blob/master/src/distributiond.py

self.listener_socket.send_pyobj({'result': 'ack'})

hudon commented 10 years ago

hmm... we might want to use the daemon's response for error-handling later... but yea, either remove that or add to the if:

if message['cmd'] != 'kill' or addr != self.worker_addr:
  response = socket.recv_pyobj()

other than that :+1:

hudon commented 10 years ago

merge away :+1: