hudon / spike

Brain Simulator Parallelization
http://nengo.ca/
1 stars 1 forks source link

URGENT: bug in ZMQ ticker #16

Closed hudon closed 11 years ago

hudon commented 11 years ago

branch: master steps to repro:

  1. go to src/old-models/distribute-zeromq/nef/ensemble.py in the run function
  2. comment out self.make_tick() and self.tick() in run (just to get theano out of the equation)
  3. run python matrix_multiplication.py
  4. this succeeds because ensembles are just receiving ticks from ticker and sending tick back
  5. now simulate an ensemble dying by making it return early. the run function should look like this:
def run(self):
    self.bind_sockets()
    #self.make_tick()
    while True:
        self.ticker_conn.recv()
        #self.tick()
        if self.name is 'A': return
        self.ticker_conn.send("")
  1. run python matrix_multiplcation.py again

expected: should NOT terminate (or get passed 1 tick), an ensemble is not responding to ticker so ticker should be blocked actual: runs as normal

It seems like when the ticker does recv(), he doesn't clear the socket or something... so he keeps receiving messages that have already been received... which makes him not wait for anybody once 1 tick has been sent back to it

hudon commented 11 years ago

I think I fixed it by receiving twice in the ticker. the first receive is supposedly the "id" of the sender, and the second receive will contain the message... (it fixes the bug reported)... now to find documentation that supports this.

RobertElder commented 11 years ago

Way to go James, +3000 points.

hudon commented 11 years ago

try this to test it out, notice how "world" is only printed for the second and fourth messages...

import zmq
import time
import sys
import threading

SOCKET_NAME = "tcp://127.0.0.1:8000"
#SOCKET_NAME = "inproc://mysocket"

def dealerRoutine(context):
    socket = context.socket(zmq.DEALER)
    socket.bind(SOCKET_NAME)
    time.sleep(1)
    socket.send("", zmq.SNDMORE)
    socket.send("hello")
    socket.send("", zmq.SNDMORE)
    socket.send("hello")
    print "first msg", socket.recv()
    print "second msg", socket.recv()
    print "third msg", socket.recv()
    print "fourth msg", socket.recv()
    socket.close()

def workerRoutine(context):
    socket = context.socket(zmq.REP)
    socket.connect(SOCKET_NAME)
    s = socket.recv()
    print "worker received", s
    socket.send("world")

context = zmq.Context()

workers = []
for i in range(0, 2):
    worker = threading.Thread(target=workerRoutine, args=([context]))
    workers.append(worker) 
    worker.start()

dealerRoutine(context)

for worker in workers:
    worker.terminated = True

context.term()
RobertElder commented 11 years ago

Just ran that on my Ubuntu laptop and got

robert@robert-ubuntu:/tmp/testing$ python test.py first msg worker received hello worker received hello

second msg world third msg fourth msg world

hudon commented 11 years ago

yep, same. so the worker sends 1 message back (tick reply, in our case), but the dealer receives 2 message parts, and the message is only in the second recv ("second msg world", "fourth msg world")... I'll try to find where this is documented then I'll close this

hudon commented 11 years ago

to read about it: http://stackoverflow.com/questions/16692807/why-do-i-need-to-receive-twice-for-every-message-sent-to-a-zmq-dealer/16694492#16694492

gist of it: we have DEALER (ticker) <-> REP (ensembles) REP knows about envelopes and handles them implicitly DEALER doesn't touch envelopes, the application must handle them explicitly

artemip commented 11 years ago

Why did this work earlier? What changed since the last successful execution of this code?

hudon commented 11 years ago

to recap on what I said in class for documentation's sake:

why did it run? It ran because the ticker never had to wait for anyone except on the first tick. On the first tick, it received messages from the ensembles, and since 2 messages were being sent every time, then the ticker always had "extra" messages to receive and so he never had to block... I wouldn't be surprised if at the end when we do the cleanup(), some processes were still computing stuff but the ticker was so fast he just killed them

now, why was the output correct (on machines that don't have the theano_tick bug, at least)? I don't know. I'm guessing that when we looked at the output on greta's machine, it was correct by coincidence (i.e. it was winning a race condition when we were testing). So, the ensembles just happened to be able to compute everything before the ticker killed the procs, and data queues were getting filled up here and there but were never full, which would have caused data loss/corruption. Maybe with different model parameters it would have failed, I don't know.

RobertElder commented 11 years ago

I spent most of today investigating the theano hanging bug and posted my findings on the github thread.

On Wed, May 22, 2013 at 5:14 PM, James Hudon notifications@github.comwrote:

to recap on what I said in class for documentation's sake:

why did it run? It ran because the ticker never had to wait for anyone except on the first tick. On the first tick, it received messages from the ensembles, and since 2 messages were being sent every time, then the ticker always had "extra" messages to receive and so he never had to block... I wouldn't be surprised if at the end when we do the cleanup(), some processes were still computing stuff but the ticker was so fast he just killed them

now, why was the output correct (on machines that don't have the theano_tick bug, at least)? I don't know. I'm guessing that when we looked at the output on greta's machine, it was correct by coincidence (i.e. it was winning a race condition when we were testing). So, the ensembles just happened to be able to compute everything before the ticker killed the procs, and data queues were getting filled up here and there but were never full, which would have caused data loss/corruption. Maybe with different model parameters it would have failed, I don't know.

— Reply to this email directly or view it on GitHubhttps://github.com/Hudon/spike/issues/16#issuecomment-18308388 .