Closed olimastro closed 7 years ago
So now the baseline (one gpu) converges with: -Valid 0.133333333333 Test 0.216 and Training took 1278.1s in 81 epochs This platoon script: -Best error valid: 0.144230769231 Best error test: 0.196 Training took 1725.5s in 111 epochs
So in terms of convergence it seems fine but the time gain is null or almost negative.
It is also still a bit rough around the edges it doesnt terminate properly.
It uses now AverageSGD instead of EASGD.
Did you compare against a single node with the same batch size, or a batch size twice as big? The right comparison would be against the larger batch size, so the number of examples seen between updates is the same.
This code now achieves the following statistics on the DGX1:
The training is still unstable and I believe it is because of the valid frequency is at 3 so it does validation every 3 epochs. The patience is at 10 so it will basically stop after 30 epochs it found the best validation. I should have the validfeq at 1 but I think it will lower the time gain because of all these validations.
@lamblin @ballasn told me that it is good to merge.
thanks!
I just saw a few small doc fix. Can you add the timing inthe sync..._lstm/README file.
In the main README, there is old timing, can you move them to the examples/lstm/README so it is clear which example give those and put link in the main README to the 2 examples README?
thanks
Rewrote the controller and the worker scripts so it works in synchronous mode with the all_reduce interface.
There are two requests that a worker can do:
First, each worker loads the full dataset and request the splits to the controller and each worker splits their dataset according to received splits.
Second, they send their validation error which the controller averages and checks if it is the best or not. After a patience count it prints the best valid and matching test results and kills all the workers.
So far it does not have a saving parameters scheme or reloading scheme.
gh-78