hudon / spike

Brain Simulator Parallelization
http://nengo.ca/
1 stars 1 forks source link

Current build hangs on a simple version of matrix_multiply.py code. #12

Closed RobertElder closed 11 years ago

RobertElder commented 11 years ago

It looks like the program hangs (or goes extremely slowly) on some inputs.

I have created a reproducible error example that you can run at spike/tests/error-1

The observed result is the program stops making progress after Iteration 2.

The expected result is seen by changing this line:

command="python -m ${THIS_SCRIPT_DIRECTORY}/matrix_multiplication ${THIS_SCRIPT_DIRECTORY}/../../src/distribute-proto"

to this line

command="python -m ${THIS_SCRIPT_DIRECTORY}/matrix_multiplication ${THIS_SCRIPT_DIRECTORY}/../../examples/theano"

in show-error.sh (then run ./show-error.sh again)

RobertElder commented 11 years ago

Also, I remember that we expected problems (deadlocks?) with some cases that were recurrent networks? As far as I can tell this one is not a recurrent network. I also changed the example to make it simpler to run faster. If you think some of the simplification makes the example invalid, check out an earlier version of the error case and run that one too. They should all re-produce the error.

RobertElder commented 11 years ago

Upgraded from theano 0.6.0rc1 to 0.6.0rc3

Current versions (was unable to install a newer build of numpy)

theano.version.version '0.6.0rc3' scipy.version.version '0.9.0' numpy.version.version '1.6.1'

Theano started generating some warnings about using dictionary types, and to change them to OrderedDict type. Did this, and re-ran test case. Still hangs.

This was the theano warning:

UserWarning: The parameter 'updates' of theano.function() expects an OrderedDict, got <type 'dict'>. Using a standard dictionary here results in non-deterministic behavior. You should use an OrderedDict if you are using Python 2.7, or use a list of (shared, update) pairs. Do not just convert your dictionary to this type before the call as the conversion will still be non-deterministic.

RobertElder commented 11 years ago

robert@robert-ubuntu:~/spike/test/error-1$ ./show-error.sh /usr/bin/python -m /home/robert/spike/test/error-1/matrix_multiplication /home/robert/spike/test/error-1/../../src/distribute-proto neurons: 11 Iteration 0 1365376932.35 Iteration 1 1365376936.38 ^CProcess input A: Traceback (most recent call last): Process A: Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/home/robert/spike/test/error-1/matrix_multiplication.py", line 46, in net.run(0.001) File "/home/robert/spike/test/error-1/../../src/distribute-proto/nef/network.py", line 118, in run timer_conn.recv() KeyboardInterrupt self.run() Traceback (most recent call last): File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run self._target(_self._args, _self._kwargs) File "/home/robert/spike/test/error-1/../../src/distribute-proto/nef/ensemble.py", line 179, in run ticker_conn.recv() File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run self._target(_self._args, _self._kwargs) File "/home/robert/spike/test/error-1/../../src/distribute-proto/nef/input.py", line 43, in run ticker_conn.recv() KeyboardInterrupt KeyboardInterrupt ^CError in atexit._run_exitfuncs: Traceback (most recent call last): File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs func(_targs, _kargs) File "/usr/lib/python2.7/multiprocessing/util.py", line 295, in _exit_function p.join() File "/usr/lib/python2.7/multiprocessing/process.py", line 145, in join res = self._popen.wait(timeout) File "/usr/lib/python2.7/multiprocessing/forking.py", line 148, in wait return self.poll(0) File "/usr/lib/python2.7/multiprocessing/forking.py", line 133, in poll pid, sts = os.waitpid(self.pid, flag) KeyboardInterrupt Error in sys.exitfunc: Traceback (most recent call last): File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs func(_targs, _kargs) File "/usr/lib/python2.7/multiprocessing/util.py", line 295, in _exit_function p.join() File "/usr/lib/python2.7/multiprocessing/process.py", line 145, in join res = self._popen.wait(timeout) File "/usr/lib/python2.7/multiprocessing/forking.py", line 148, in wait return self.poll(0) File "/usr/lib/python2.7/multiprocessing/forking.py", line 133, in poll pid, sts = os.waitpid(self.pid, flag)

RobertElder commented 11 years ago

I spent most of yesterday trying to track down the issue, and I added some debug print statements to see if it a concurrency issue. Note that from what I've read python print statement is not completely thread safe, but it should never mix lines and I think that it prints things in the correct order (It is really hard to find documentation). This is the best I've got until I replace print with a threadsafe method of output. Also, Ive pushed this on the error-1-investigation branch, so you can take a look:

robert@robert-ubuntu:~/spike/test/error-1$ ./show-error.sh /usr/bin/python -m /home/robert/spike/test/error-1/matrix_multiplication /home/robert/spike/test/error-1/../../src/distribute-proto Just maked pipe named Array 1 Just added pipe named Input 1 Just maked pipe named Array 2 neurons: 11 Iteration 0 UPDATING UPDATING proc.start proc.start proc.start Process 3930 Matrix Multiplication timer_conn send Process 3930 Matrix Multiplication timer_conn send Process 3930 Matrix Multiplication timer_conn send Process 3930 Matrix Multiplication timer_conn recv Started Executing run in Process 3938 Array 2 Process 3938 Array 2 ticker_conn recv Process 3938 Array 2 entering ensemble tick Process 3938 Array 2 pipe.poll() Process 3938 Array 2 tick returned false Process 3938 Array 2 ticker_conn send Process 3938 Array 2 ticker_conn recv Started Executing run in Process 3936 Array 1 Process 3936 Array 1 ticker_conn recv Process 3936 Array 1 entering ensemble tick Process 3936 Array 1 pipe.poll() Process 3936 Array 1 tick returned false Process 3936 Array 1 ticker_conn send Process 3930 Matrix Multiplication timer_conn recv Process 3936 Array 1 ticker_conn recv Started Executing run in Process 3937 Input 1 Process 3937 Input 1 ticker_conn recv Pipe.send from 3937 Input 1 Process 3937 Input 1 ticker_conn send Process 3937 Input 1 ticker_conn recv Process 3930 Matrix Multiplication timer_conn recv Iteration 1 Process 3930 Matrix Multiplication timer_conn send Process 3930 Matrix Multiplication timer_conn send Process 3930 Matrix Multiplication timer_conn send Pipe.send from 3937 Input 1 Process 3930 Matrix Multiplication timer_conn recv Process 3938 Array 2 entering ensemble tick Process 3938 Array 2 pipe.poll() Process 3937 Input 1 ticker_conn send Process 3938 Array 2 tick returned false Process 3937 Input 1 ticker_conn recv Process 3938 Array 2 ticker_conn send Process 3938 Array 2 ticker_conn recv Process 3936 Array 1 entering ensemble tick Process 3936 Array 1 pipe.poll() Process 3936 Array 1 pipe.recv() Process 3936 Array 1 tick returned true Process 3936 Array 1 about to do theano_tick() Process 3936 Array 1 after theano_tick() Process 3936 Array 1 ticker_conn send Process 3930 Matrix Multiplication timer_conn recv Process 3930 Matrix Multiplication timer_conn recv Iteration 2 Process 3930 Matrix Multiplication timer_conn send Process 3930 Matrix Multiplication timer_conn send Process 3930 Matrix Multiplication timer_conn send Pipe.send from 3937 Input 1 Process 3930 Matrix Multiplication timer_conn recv Process 3938 Array 2 entering ensemble tick Process 3938 Array 2 pipe.poll() Process 3937 Input 1 ticker_conn send Process 3938 Array 2 pipe.recv() Process 3937 Input 1 ticker_conn recv Process 3938 Array 2 tick returned true Process 3938 Array 2 about to do theano_tick() Process 3936 Array 1 ticker_conn recv Process 3936 Array 1 entering ensemble tick Process 3936 Array 1 pipe.poll() Process 3936 Array 1 pipe.recv() Process 3936 Array 1 tick returned true Process 3936 Array 1 about to do theano_tick() Process 3936 Array 1 after theano_tick() Process 3936 Array 1 ticker_conn send Process 3930 Matrix Multiplication timer_conn recv Process 3930 Matrix Multiplication timer_conn recv Process 3936 Array 1 ticker_conn recv

What looks like the common element in all cases, is that Array 2, goes into theano tick and never comes out again. This causes a hang because all other processes are waiting on recv, so nothing can progress.

RobertElder commented 11 years ago

I looked into this again today, and since my version of numpy is earlier than yours, I tried to upgrade, but after spending 4 hours on this, I can't get it to work properly with 1.7.0. I can get the install to work, but there is some problem with a symbol not being found inside a atlas library, when it goes inside the theano code.

I am currently running 1.6.1, so if you have the motivation, you could downgrade your version and test it out for me to see if that gives you the same error.

Also of concern is that when I run some of the post installation tests here I get a couple failures in scipy

http://deeplearning.net/software/theano/install_ubuntu.html#install-ubuntu

RobertElder commented 11 years ago

After further investigation it looks like this bug might be caused by a deadlock caused by accessing shared data structures.

A root cause can be traced down to a theano virtual machine extension used to evaluate call graphs.

To replicate my observations you will need to modify one of the C extensions that theano loads.

This file is located at

/usr/local/lib/python2.7/dist-packages/theano/gof/lazylinker_c.c.txt

on my computer.

In this c file, include

include

On line 532, inside the function c_call, replace the line

530 else 531 { 532 err = fn(self->thunk_cptr_data[node_idx]); 533 }

with

530 else 531 { 532 printf("Before err = \n"); 533 err = fn(self->thunk_cptr_data[node_idx]); 534 printf("After err = \n"); 535 }

Delete the theano compiled extension cache or your changes will not be reflected:

sudo rm -rf ~/.theano

On branch error-1-investigation do

./show-error.sh

You will sometimes observe the program hang on the statement

Before err =

Leading me to conclude that the instruction

err = fn(self->thunk_cptr_data[node_idx]);

might be blocking the process into a deadlock.

This error does not happen every time, but I've replicated it on most tries.

Also, note that I've made the assumption that there is no context switching or non-standard branching going on here, which I think is a reasonable assumption based on what the code is doing.

RobertElder commented 11 years ago

Oh yeah, it might be an infinite loop in fn too.

hudon commented 11 years ago

Interesting... So maybe we're sharing data between procs we shouldn't be sharing? try stripping down matrix_multiplication.py to be just 2 nodes: 'input B' and 'B', and comment out all other nodes and connections. Then, we can further narrow down to exactly what data we might be sharing between the procs when we shouldn't

RobertElder commented 11 years ago

I think it is already as minimal as I can make it and still get the error. Whenever I try to make a number lower or remove a layer, the problem goes away. (But I don't really know what I'm doing that well since the topology of networks is still not clear to me.)

RobertElder commented 11 years ago

This problem is caused by a deadlock involving code inside the libopenblas.so.0 linear algebra library.

For me, it was solved by going to

http://packages.debian.org/sid/libatlas3-base

and selecting the link for amd64 (even though I'm on intel?)

http://packages.debian.org/sid/amd64/libatlas3-base/download

Then make a backup copy of the file located at

/usr/lib/openblas-base/libopenblas.so.0

After extracting the archive, copy the file

~/Downloads/libatlas3-base_3.8.4-9.1_amd64/usr/lib/atlas-base/atlas/libblas.so.3.0

to the location

/usr/lib/openblas-base/libopenblas.so.0

overwriting the file that is there.

The error-1 example should work properly, and only misbehave when the original library is reinstated.

Doing these operations will only make sense if your installation symlinks are the same as mine.

The file /usr/lib/openblas-base/libopenblas.so.0 is referenced from

/usr/lib/libblas.so.3gf

which symlinks to

/etc/alternatives/libblas.so.3gf

which symlinks to

/usr/lib/openblas-base/libopenblas.so.0

Further information from gdb trace:

1) switch to the error-1-investigation branch and cd to the directory error-1

2) run the command

gdb --args /usr/bin/python -m /home/robert/spike/test/error-1/matrix_multiplication /home/robert/spike/test/error-1/../../src/distribute-proto

3) Once deadlock happens hit ctrl c, then type the commands shown below:

(gdb) info threads
  Id   Target Id         Frame 
  2    Thread 0x7ffff38c0700 (LWP 22695) "python" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
* 1    Thread 0x7ffff7fd8700 (LWP 22691) "python" 0x00007ffff7bcbd2d in read () at ../sysdeps/unix/syscall-template.S:82
(gdb) thread 2
[Switching to thread 2 (Thread 0x7ffff38c0700 (LWP 22695))]
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
162 ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S: No such file or directory.
(gdb) bt
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00007ffff3c98dbd in ?? () from /usr/lib/libblas.so.3gf
#2  0x00007ffff7bc4e9a in start_thread (arg=0x7ffff38c0700) at pthread_create.c:308
#3  0x00007ffff699eccd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#4  0x0000000000000000 in ?? ()
(gdb) thread 1
[Switching to thread 1 (Thread 0x7ffff7fd8700 (LWP 22691))]
#0  0x00007ffff7bcbd2d in read () at ../sysdeps/unix/syscall-template.S:82
82  ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) bt
#0  0x00007ffff7bcbd2d in read () at ../sysdeps/unix/syscall-template.S:82
#1  0x00007fffede5eac7 in ?? () from /usr/lib/python2.7/lib-dynload/_multiprocessing.so
#2  0x00007fffede5fc70 in ?? () from /usr/lib/python2.7/lib-dynload/_multiprocessing.so
#3  0x000000000049890a in PyEval_EvalFrameEx ()
#4  0x0000000000498602 in PyEval_EvalFrameEx ()
#5  0x000000000049f1c0 in PyEval_EvalCodeEx ()
#6  0x00000000004a8b62 in PyEval_EvalCode ()
#7  0x000000000049af5f in PyEval_EvalFrameEx ()
#8  0x000000000049f1c0 in PyEval_EvalCodeEx ()
#9  0x00000000004983b8 in PyEval_EvalFrameEx ()
#10 0x000000000049f1c0 in PyEval_EvalCodeEx ()
#11 0x00000000004a8960 in ?? ()
#12 0x00000000004e9f36 in PyObject_Call ()
#13 0x00000000004a53a2 in ?? ()
#14 0x00000000004aaad5 in Py_Main ()
#15 0x00007ffff68cc76d in __libc_start_main (main=0x41b980 <main>, argc=4, ubp_av=0x7fffffffe578, init=<optimized out>, fini=<optimized out>, 
    rtld_fini=<optimized out>, stack_end=0x7fffffffe568) at libc-start.c:226
#16 0x000000000041b9b1 in _start ()
hudon commented 11 years ago

blink blink, and then the bug just goes away? so it's not our fault? would be good to see what version of the blas lib @gretac has on Arch... to see if it further supports this

hudon commented 11 years ago

also, if this is in fact the fault, we need to think of solutions so that our users can get it working on ubuntu... I know, for instance, that they recommend openBLAS instead of ATLAS, because it's faster and allows you to select threads, but the shared lib you mention fixes the bug is from ATLAS... we can further discuss Sunday, I'll try this out tomorrow :) good work man!

RobertElder commented 11 years ago

It is possible that some of the theano stuff was never designed with multiprocessing in mind, so it does some things it shouldn't, and the problem only manifests because of certain race conditions with that particular library.

hudon commented 11 years ago

good point... once I reproduce the bug tomorrow, I'll do some research too. We might want to escalate to nengo James, see if he's aware of any multiprocessing+theano+BLAS issues. That would be a pretty great find, I kept assuming the problem was on our side :P

hudon commented 11 years ago

So BLAS has a few implementations. What @RobertElder and I are running is openBLAS. On Arch, the standard BLAS package is the LAPACK BLAS, also known as NetLib BLAS (mentioned in wiki article). This difference is probably why the bug wasn't appearing on Arch.

RobertElder commented 11 years ago

So that means we've seen 2 BLAS implementations that work and one that does not. That would at least lend credibility to the theory that there is a bug in the openblas version. I sure hope it is there. I don't look forward to debugging more concurrency problems like this if the problem is actually in theano. I've spent probably 4+ days debugging this issue.

RobertElder commented 11 years ago

Just a note, if you're trying to reproduce my results you might need to occasionally do

rm -rf ~/.theano

RobertElder commented 11 years ago

After re-compiling the blas library with debug information enabled, the line that it seems to be hanging on is line 308 of

https://github.com/xianyi/OpenBLAS/blob/develop/driver/others/blas_server.c

It always ends up waiting on a signal for pthread 0, even though there is a signal sent for pthread 0 on line 677.

hudon commented 11 years ago

@RobertElder , Why do we need to clear ~/.theano? Is it some sort of theano_function() cache? it doesn't get recompiled everytime we run the simulator?

RobertElder commented 11 years ago

Yea theano caches a bunch of compiled virtual machines and stuff there.

RobertElder commented 11 years ago

But it does recompile the stuff that is actually computed in theano_function() every time you run the program, it just caches a bunch of the stuff that supports being able to evaulate these call graphs. Stuff like

/usr/local/lib/python2.7/dist-packages/theano/gof/lazylinker_c.c.txt

is cached there.

gretac commented 11 years ago

So I tried replacing my blas package by the openblass-0.2.6-1 package. After changing that, I cleared the ~/.theano and ran the show_error script. I get the same problem as you guys, it just halts in theano_tick() in the first iteration and nothing happens. After, I re-installed the netlib blas package and again cleared the ~/.theano, everything worked fine. So, I have blas-3.4.2-1 (its installed by lapack and both are netlib implementations).

Also, it was necessary for me to run the rm -rf ~/.theano after switching these packages. Otherwise, when switching from openblas to netlib blas theano was still trying to use an openblas library.

RobertElder commented 11 years ago

Based on the code in openblas, and messing around with print statements that I inserted, it might be possible that openblas is working correctly, but because it is doing some fancy stuff with blocking threads, when we do our own fancy stuff with blocking processes, these two might be incompatible somehow?

RobertElder commented 11 years ago

Also, note that uninstalling theano does not clear the theano cache.

Gotta do lots of

rm -rf ~/.theano

hudon commented 11 years ago

This is fixed now by using another implementation of BLAS (not openBLAS).

In Ubuntu, on both my machine and robert's, we found that both NetLib blas (packages libblas3 or libblas3gf) and openBlas (openblas-base) were installed... so just do apt-get purge openblas-base and now it should work (a system reboot was required for me).