hudon / spike

Brain Simulator Parallelization
http://nengo.ca/
1 stars 1 forks source link

Subnetworks with probes cause deadlock. #53

Closed RobertElder closed 11 years ago

RobertElder commented 11 years ago

On branch 'subnetworks' run make test to reproduce error

It seems that most of the time the program will deadlock, and ctrl-c will show a stack trace in zmq.core.error.ZMQError.

Also, I noticed that when the example does run, it doesn't output anything for the probe data. I'm not sure if this is expected or not.

It is also possible that I am using the external interface incorrectly, since there may be multiple valid ways to do this with the existence of subnetworks.

gretac commented 11 years ago

So one thing I did to make it run is move the get_data() function calls for probes to occur after the network was run. In theory, getting data right after construction should not cause problems, but seems like that is not what is happening. In general, we should always be accessing probe data after the network has run since that is how the probes get populated and that is why you were not seeing any data in the probes when the example did manage to run.

The next issue I notice is that even after moving the probes the example still does not run and deadlocks most of the time. However, that is not due to the probes. When I remove all the probes from the network, the deadlocks still occur.

I will be examining this further.

RobertElder commented 11 years ago

Oh yeah, I'm not sure why I put the get_data call before the network runs, that clearly doesn't make any sense. If moving the get_data calls didn't fix it on ours, make sure you run it against theirs too because some of their test cases don't run perfectly against their code either.

Use a command like

$ /home/travis/virtualenv/python2.7/bin/python2 /home/travis/build/Hudon/spike/test/nengo_tests/test_subnetworks.py /home/travis/build/Hudon/spike/test/../examples/new-theano

and if that doesn't work, then their thing is bused too. I'm pretty sure it was passing with theirs before too.

hudon commented 11 years ago

What do you mean by "some of their test cases don't run perfectly against their code"?

RobertElder commented 11 years ago

There are some of the test cases that I had to tweak in order to get them to run against their code, and if I remember correctly, there are a few that I couldn't get to work with their code yet. Anything that I have filed a bug for, or a pull request, is running and producing something that 'looks' correct with their code, but doesn't work with our code.

gretac commented 11 years ago

So the deadlocking of ensembles occurred because some active ensembles were trying to communicate to ensembles that have finished (i.e. completed and exited). However, the connection of the exited ensemble is no longer valid/active at that point, so the active ensemble could not send to it. To avoid this, I forced all ensembles to synchronize before exiting. Thus no ensemble exits until all ensembles are finished.