LSTM example with all_reduce interface

mila-iqia / platoon

Multi-GPU mini-framework for Theano

MIT License

195 stars 41 forks source link

LSTM example with all_reduce interface #83

Closed olimastro closed 7 years ago

olimastro commented 7 years ago

Rewrote the controller and the worker scripts so it works in synchronous mode with the all_reduce interface.

There are two requests that a worker can do:

First, each worker loads the full dataset and request the splits to the controller and each worker splits their dataset according to received splits.
Second, they send their validation error which the controller averages and checks if it is the best or not. After a patience count it prints the best valid and matching test results and kills all the workers.

So far it does not have a saving parameters scheme or reloading scheme.

gh-78

olimastro commented 7 years ago

So now the baseline (one gpu) converges with: -Valid 0.133333333333 Test 0.216 and Training took 1278.1s in 81 epochs This platoon script: -Best error valid: 0.144230769231 Best error test: 0.196 Training took 1725.5s in 111 epochs

So in terms of convergence it seems fine but the time gain is null or almost negative.

It is also still a bit rough around the edges it doesnt terminate properly.

It uses now AverageSGD instead of EASGD.

lamblin commented 7 years ago

Did you compare against a single node with the same batch size, or a batch size twice as big? The right comparison would be against the larger batch size, so the number of examples seen between updates is the same.

olimastro commented 7 years ago

This code now achieves the following statistics on the DGX1:

1 gpu:
- run 1) 105 epochs, 552.4s training time, 0.1714 best valid, 0.266 best test
- run 2) 72 epochs, 441.7s training time, 0.18 best valid, 0.19 best test
2 gpu:
- run 1) 90 epochs, 199.2s training time, 0.1634 best valid, 0.194 best test
- run 2) 103 epochs, 231.3s training time, 0.1634 best valid, 0.21 best test

The training is still unstable and I believe it is because of the valid frequency is at 3 so it does validation every 3 epochs. The patience is at 10 so it will basically stop after 30 epochs it found the best validation. I should have the validfeq at 1 but I think it will lower the time gain because of all these validations.

nouiz commented 7 years ago

@lamblin @ballasn told me that it is good to merge.

thanks!

nouiz commented 7 years ago

I just saw a few small doc fix. Can you add the timing inthe sync..._lstm/README file.

In the main README, there is old timing, can you move them to the examples/lstm/README so it is clear which example give those and put link in the main README to the 2 examples README?

thanks