mila-iqia / blocks

A Theano framework for building and training neural networks
Other
1.16k stars 352 forks source link

Data Parallelism for Blocks #664

Open rizar opened 9 years ago

rizar commented 9 years ago

It would be nice to have multi-GPU parallel data processing in Blocks. Something like:

algorithm = GradientDescent(...)
algorithm = ParallelAlgorithm(algorithm, device=gpu, batch_axis=0)

https://github.com/Theano/Theano/wiki/Using-Multiple-GPUs should be helpful for that.

dwf commented 9 years ago

I don't think this is practical at all, since AFAIK you cannot have imported Theano in the parent before doing this, and most files in Blocks import Theano. It's an unfortunate side effect (no pun intended) of having import side effects.

rizar commented 9 years ago

Hmm... you probably know better, but if we create new processes, maybe theano can be imported a new in every of these?

It could be the killing feature of Blocks. For recurrent networks with lots of weight sharing, when a batch takes 8 seconds to be processed, the speedup would be great. And Torch supports such things almost out of the box, AFAIK.

bartvm commented 9 years ago

Torch does support this pretty much out of the box, but implementing it with Theano is far from straightforward. All the examples I have ever seen involve Theano being imported and the graph constructed in the subprocess, while your example assumes everything is done in the parent process and then somehow transferred. No idea how/whether that will work.

Then there's the parameter sharing between two processes, which will probably need something like what we do for the Fuel server with ZMQ (pickling a few million parameters and sending them over a pipe will be too slow).

An approach that is probably easier to implement, but far less straightforward, is for the user to start a kind of parameter server using a separate script, and then write an extension that connects to this parameter server and sends/receives parameters. The user can then simply start the parameter server once, and run their training script multiple times (potentially on multiple nodes!) to do multi-GPU/multi-node training.