mila-iqia / platoon

Multi-GPU mini-framework for Theano
MIT License
195 stars 41 forks source link

LSTM example with all_reduce interface #83

Closed olimastro closed 7 years ago

olimastro commented 7 years ago

Rewrote the controller and the worker scripts so it works in synchronous mode with the all_reduce interface.

There are two requests that a worker can do:

So far it does not have a saving parameters scheme or reloading scheme.

gh-78

olimastro commented 7 years ago

So now the baseline (one gpu) converges with: -Valid 0.133333333333 Test 0.216 and Training took 1278.1s in 81 epochs This platoon script: -Best error valid: 0.144230769231 Best error test: 0.196 Training took 1725.5s in 111 epochs

So in terms of convergence it seems fine but the time gain is null or almost negative.

It is also still a bit rough around the edges it doesnt terminate properly.

It uses now AverageSGD instead of EASGD.

lamblin commented 7 years ago

Did you compare against a single node with the same batch size, or a batch size twice as big? The right comparison would be against the larger batch size, so the number of examples seen between updates is the same.

olimastro commented 7 years ago

This code now achieves the following statistics on the DGX1:

The training is still unstable and I believe it is because of the valid frequency is at 3 so it does validation every 3 epochs. The patience is at 10 so it will basically stop after 30 epochs it found the best validation. I should have the validfeq at 1 but I think it will lower the time gain because of all these validations.

nouiz commented 7 years ago

@lamblin @ballasn told me that it is good to merge.

thanks!

nouiz commented 7 years ago

I just saw a few small doc fix. Can you add the timing inthe sync..._lstm/README file.

In the main README, there is old timing, can you move them to the examples/lstm/README so it is clear which example give those and put link in the main README to the 2 examples README?

thanks