imrove validation accuracy withot MC-sampling from 86% to 98.8-99.2% in 200 epochs

malashinroman commented 4 years ago

The latest version from the repository gave about 2.5% error rate with monte-carlo test sampling. I think monte-carlo trick shouldn't be used, because it allows to countervail poor attention mechanism. But AM is of the main interest. Without MC-sampling the code gave me ~14% error rate.

I made comparatively small changes to the code which allowed achieving up to 1.2 - 0.8% error rate (depending on random_seed) without monte-carlo sampling (M=1) with six 8x8 glimplses. This is very similar (and even slightly better) to the accuracy from the Mnih's paper. When using whole train set, error rate on test set can reach less than 0.7%.

Here is description of the changes I made:

detach l_t from the computational graph (according to the paper, locator is learned solely by reinforce signal. Doing so really improves the accuracy);
increase depth of the locator network - add hidden layer with relu (we use tahn on mu which is nonlinear operation. That means we need complex nonlinear change in hidden state to produce window positions by the one-layer locator network. If we make nonlinear transformation available to be made by the locator network itself, this complication can be avoided);
decrease noise std in locator network to 0.05 (large variance means that really smart attention policy cannot change a lot and therefore will not be learned);
replace _tahn(lt) in locator network with clamp(_lt, -1,1) (mu is produced with tahn in the range of (-1,1), second tahn on _lt squeezes it to the range (-0.76, 0.76) that means that location network cannot move the window close to the border of the image);
replace .data[0] with .item() to support new pytorch versions;
use large batch_size - 128 (A3C reinforcement learning algorithm is more stable in the case);
add ReduceLROnPlateau scheduler with 20 epoch patience, which was commented (decreasing lr during course of training is in Mnih's paper).
I use negative validation accuracy for scheduler (loss can have complex changes, because of competitive nature of baseline and reinforce losses);
I discount reinforce loss by the factor of 100 to prevent significant changes in location policy on each step.
there are minor changes to improve the code (use _initlr from the command line in optimizer constructor, dump actual lr to console).

With this changes best validation result is achieved with around 200 epochs of training.

kevinzakka commented 4 years ago

Wow, this is super helpful @malashinroman! I haven't touched this repo in years but the changes you've made all seem like super sensible design choices so thanks for that :)

kevinzakka commented 4 years ago

Do you think you could update the README with these results?

malashinroman commented 4 years ago

@kevinzakka, sorry for not replying for a while.

I haven't touched this repo in years

What I've found is that your repository is the most popular pytorch implementation of RAM, and to my knowledge not too much is done in hard attention mechanism since the Mnih's et. al paper. I believe people will still find it useful (as I've found it to be useful by myself).

I see that you've already pushed changes to the README. Let me know if you need any info from me.

kevinzakka / recurrent-visual-attention

imrove validation accuracy withot MC-sampling from 86% to 98.8-99.2% in 200 epochs #32