Open hhhmoan opened 8 years ago
@hhhmoan Hi! Thank you very much for spending time reading the code and pointing out mysterious aspects!
The code actually works WITHOUT the pretraining. I was just a little bit curious to see if pretraining can improve anything. By the way, I just realized that the implementation for the pretraining part incomplete (it is not training the location network)...
On the other hand, I totally understand your suggestion about the stop_gradient for the location output. I will do some testing about that! We are still actively working on debugging. I will report our attempts to replicate the original results by Mnih et al. ASAP!
Thank you very much for your suggestion again!
^_^. To say the truth, i failed to implement this paper in translate clutter mnist data in tensorflow.if you success and release your result,it can really help me.
my best wish to you work
I'm glad to hear that! We are working on it!
there are some ops in RAM is not differentiable, and in the paper, use the reward to replace the gradient so that we can do the BP.But in the code, your reward is in the loss, so the reward just like a normal loss and tf will compute the gradient include the ops which is not differentiable. i see that some tf.stop_gradient you use. is that mean you just stop the gradient of these ops? if it is true ,i dont know why the model work,because the params in the LocNet is frozen.
@Lzc6996 You are right! Our StopGrad operation is incorrect and we still haven't figure out how to resolve this issue.
We failed to realize that because the network still "works". It might be the case that other layers can adapt (by tuning their parameters) the terrible parameters in the LocNet.
Qihong, have you work out how to update the parameters in LocNet yet?
Sorry, not yet. Our code is incorrect for some more fundamental reasons. I am still not exactly sure how to fix it. You can take a look at this repo: https://github.com/zhongwen/RAM
This RAM implementation beats ours on the 28x28 standard MNIST.
Thanks for your favorable and prompt reply. And do you think the implementation that you mentioned above realize the parameters updates as Mnih's paper describes?
@JasonZhao001 I am not sure. I plan to replicate Minh's results with that implementation. I would be more certain if the replication is successful.
@JasonZhao001 And I also plan to visualize that implementation with tensor board.
Yeah, I found that the work done by Zhongwen drops check point and summary(tensorboard), its will be helpful if they are added.
And there are a question that still confuses me when I try to work on your codes :
The "DRAW WINDOW" functions don't work on my machine, even if I have set the control parameters to True. I have put a "print" in the animate block and it prints as I set when training but still no windows shows. I wonder its the problem only in my case, so can you tell me if it works in your platform now?
@JasonZhao001 That's strange. "draw" should work when you set draw to 1. Can you send me the error message? Thanks!
There is no error message but just don't show the window as it should be. So I doubt it may because of my platform's problem.
@JasonZhao001 I see. Let me know if you get more clues about what is going on. I am more than happy to help!
@QihongL found the reason. It's because that the matplotlib has something wrong in my platform, when I $sudo pip uninstall matplotlib, it worked! I may be because I had install two versions of matplotlib and when I installed the second one, I set to ignore the excited one. Thanks a lot!
@JasonZhao001 Great!
@QihongL I have found the error in your gradient implementation. The gradient should flow only via mean_loc not from samp_loc, because the samp_loc gives you the location in the input image from where you should be sampling the next image, and hence become non-differential. But when you define the loss function:
you back-propagate the gradient of the loss in the computation graph of which mean_loc is the part of, hence you calculate the gradient w.r.t mean_loc. You don't calculate the gradient of the loss w.r.t samp_loc.
EDIT So, if you comment the line:
mean_loc = tf.stop_gradient(mean_loc)
and keep the line:
sample_loc = tf.stop_gradient(sample_loc)
things should work. Let me know if it works for you.
@GodOfProbability Hi! Thank you so much for pointing this out!
I also think this is causing the trouble. I think I tried commenting that line out before and it didn't work. I guess if I don't stop_grad it, then the gradients flow over time (across different glimpses).
@QihongL I did an experiment on some toy example, and stopping only the gradient from the sample_loc improves the performance and not the other way around.
I guess if I don't stop_grad it, then the gradients flow over time (across different glimpses).
I think it will not propagate the gradient over time because only sample_loc interact with the next time step, not the mean_loc. Hence if you stop gradient w.r.t sample_loc, it is sufficient to stop the "bad" gradient from flowing across time. Furthermore, a thing to keep in mind is that mean_loc are different objects predicted at every time step and you start the back-propagation by finding the gradient w.r.t mean_loc, hence there is nothing coming from the non-differentiable part.
@GodOfProbability That's very interesting... I will try that! Thank you very much for the suggestion! I will let you know what I find out!
@GodOfProbability @QihongL We make an assumption that we don't sampling but use the mean_loc straightly, just similar with the soft attention in "show, attend and tell" Then, the question are:
"bad" gradient from flowing across time.
then the gradients flow over time (across different glimpses).
@JasonZhao001 If you stop gradient at sample_loc, the bad_gradients will not flow, because only sample_loc interacts across time, and if you stop gradient flowing from sample_loc, you are actually stopping gradient (across time) to flow through mean_loc. However, there are gradients coming from the loss function that corresponds to reward function, wich should flow through mean_loc. The gradient should flow through mean_loc from sample_loc (this gradient comes from the differentiation of the monte carlo approximation of the gradient of the reward function.) If time permits, you should do the experiments, and let us know.
@GodOfProbability Yes, you are right! The parameters at location generation module rely on the the derivative of log[P(sample_loc|mean_loc,sigma)] w.r.t. parameters_loc to update, which is actually the derivative of mean_loc w.r.t. parameters_loc. And I will do experiments on it later, and I will try the assumption as well. And will report my results then.
@QihongL @GodOfProbability @jlindsey15 @Lzc6996 It proves working well when I comment the line mean_loc = tf.stop_gradient(mean_loc) as Gopal described above. When using the bandwidth = 12, it converge at more than 96% accuracy 600k time step (early stop). And I'm sure that it will get Mnih's result by tuning some parameters when training. By the way, if you stop_gradient at mean_loc, it shows in tensorboard that the parameters here never update when training. And I have a possible reason why it still works in that implementation. It is because that the attention window with three scales especially the 2ed and 3rd one could cover enough information (48 w.r.t. 60), meanwhile, the bandwidth is large enough to recognize a little lower resolution from the 2ed and 3rd scale (1/2 and 1/4 times respectively). So it result relies on the fully connected layers for classification and it is actually the same with the fixed location glimpse to recognition. You can have a try by using only two scales (e.g. 12 and 24) or make the bandwidth smaller (e.g. 8 or 6), then it will not work so well. It is same for the work in https://github.com/zhongwen/RAM Moreover, if you fix the problem, when make the bandwidth smaller (e.g. 8), it will perform better with higher accuracy and faster convergence. I did the experiment that it will converge to 97% at 400k time step! (So I make an early stop) Furthermore, I found it does not apply the M times samples in this original implementation. I plan to try it and any suggestions from you will help it a lot! If I succeed, I will share the sourse code as well. Thanks!
@JasonZhao001 @GodOfProbability Hi Jason, I read your comment very interestingly, I also solve the stop_gradient problem, but cannot achieve high performance as like you (in my case, about 94%, translated case) Can I know your hyperparameters and learning strategies? Moreover, except for mean_loc issue, do you think that the baseline in this code is correctly implemented? I think that baseline should be also learnable.. plz give me your opinion! thx!
@jtkim-kaist The baseline tech is very important to location prediction. It is learnable as an extra term of cost function as is shown in the source code below:
J = J - tf.reduce_sum(tf.square(R - b), 1)
Note that the parameters at baseline part is learnt separately with the other to parts. Of course, I have modified some of the hyperparas to make it work better, but I know its not the best, I'm still trying. If I succeed, I will post my implementation later.
@JasonZhao001 Thank your for your kind comment,
I also agree with you. However, in this code, the baseline is implemented like below,
baseline = tf.sigmoid(tf.matmul(hiddensState, Wb_h_b)+Bb_h_b)
and, both Wb_h_b and Bb_h_b seems that they can't learn due to stop_gradient function.
when stop_gradient function is off, the baseline depends on hidden_state so that it seems not right as you said (Note that the parameters at baseline part is learnt separately with the other to parts.)
So, I think, "baseline = tf.sigmoid(some independent variable to model)" is more appropriate.
Please give me your opinion thx! (I'm also in proceed)
@jtkim-kaist
b shouldn't be updated from this term:
J = tf.concat(1, [tf.log(p_y + SMALL_NUM) * (onehot_labels_placeholder), tf.log(p_loc + SMALL_NUM) * (R - no_grad_b)])
where no_grad_b = tf.stop_gradient(b) can prevent it to update.
While b is updated via this term:
J = J - tf.reduce_sum(tf.square(R - b), 1)
where b is not stopped gradient.
@JasonZhao001 Thank you! I missed that part.
I'll expect your implementation have a good day!
@jtkim-kaist You are welcome :)
Hi, Folks, thank you so much for thinking about these issues! I really appreciate it! I have been traveling for PhD interviews recently so I didn't had time to work on this project for a while. lol
Hi, sorry I haven't been active here either! It's a bit unclear to me whether the issue has been resolved or not -- what's the current status?
Hi guys, I 'm working on another conference paper which get nearly the submit deadline (3.17), so I have no time with it now. But after that I plan to clean the codes and upload them, meanwhile, a double layers LSTM together with the CNN inside might be uploaded as well. Buy the way, the best test accuracy is 97.8% up to now.
Hi I upload my code into my repository with solving this issue,
If you have some time, please visit to 'https://github.com/jtkim-kaist/ram_modified'. Thank you.
@jtkim-kaist Really appreciate it! I will look at your code ASAP!
You still have stop_gradient in the line. It must be causing problem.
@GodOfProbability your comment is excellent, I am a Chinese PhD student struggle for visual attention model.
The gradient should flow only via mean_loc not from samp_loc, because the samp_loc gives you the location in the input image from where you should be sampling the next image, and hence become non-differential.
sample_loc = tf.maximum(-1.0, tf.minimum(1.0, mean_loc + tf.random_normal(mean_loc.get_shape(), 0, loc_sd)))
Why you say the samp_loc must stop_gradient??? I think don't gradient through samp_loc means the network still don't learn how to put the sensor in the right place ?@jtkim-kaist why you say the baseline must be stop_gradient? I saw your code , the baseline is set to tf.stop_gradient, Why?
@sharpstill Here are my answers to your questions:
@GodOfProbability My 2nd question is following the 1st question, If you STOP gradient from sample_loc, Then how does the network finally know where is the right place to extract glimpse window? Is it from mean_loc back propagation? Isn't the sample_loc made from mean_loc? I have read that paper, But I don't understand how does REINFORCE algorithm work. It looks same as standard Back propagation in deep learning ? Isn't it? PS: Do you have the correct program source code ??? I have struggle it for lone time, I find it is too hard to debug tensorflow!
mean_loc is the mean of the gaussian from which you sample samp_loc. So, that implies mean_loc makes samp_loc. For now you can treat the gradient flowing from the mean_loc. As I described in my previous comment, to train the policy network (that is essentially the network predicting the mean_loc):
you need to maximize the above quantity. Intuitively in the above equation, you can think that if you get positive reward and you sample samp_loc closer to the mean_loc, you should motivate the above situation more, and if you get negative reward and still you sampled samp_loc closer to mean_loc, you should discourage the above situation.
In general, you differentiate the above loss w.r.t to the mean_loc only and don't differentiate w.r.t the samp_loc. If you just remove the stop gradient from the mean_loc and keep the stop_gradient at samp_loc, you should get better performance.
I plan to make a tutorial on the REINFORCE and also write this code from scratch in PyTorch after my college exams.
@GodOfProbability Thank you very much for your help. It's very kind of you . If you write a tutorial on the REINFORCE, Can you notify me (tell me) in Email, My Email is sharpstill@163.com what is your email or other contact form? We Chinese use mobile software WeChat ( Instant chat tool software), Do you use WeChat ? Can I add a friend?
@sharpstill You can contact me on gopalsharma@cs.umass.edu. I am afraid, I don't use WeChat.
@GodOfProbability @jtkim-kaist
I download the modified code in https://github.com/jtkim-kaist/ram_modified
this code seems correct the issue we mentioned.
Did you tried to show the image of the last location window?
I want to use this method to find the most discriminative part of the image.
I put the mnist digit in random location in 100 x 100 image(use numpy's pad function), and I want to look CAN the RAM model able to find the right location?
But look below is the last sample location( After 6000 times iterater) red bounding box is the window, It looks still some kind of random position??? (PS: May be it is last glimpse location, The whole sequence may influence the classify performance? )
Figure 1 : last sample location( After 6000 times iterater) red bounding box is the glimpse window
First of all, big thanks to all of you who provided very insightful comments!
In response to the @sharpstill's comment above: it seems that the training is long not enough or the learning rate isn's appropriate, since I was able to get <10% error even before @jtkim-kaist's improvement. Nothing can be said conclusively without seeing the the learning curve though.
@jtkim-kaist: Have you tried replicating Mnih's results on the translated MNIST task (either 60x60 or 100x100)? If I remember correctly, you were able to get 2% error on the untranslated task, whereas the original paper reports 1.29% (with six 8x8 single scaled glimpses). Do you think this error difference is completely due to hyperparameter tuning or the choice of optimization method? I think the original paper used vanilla batch stochastic gradient descent, so some alternatives, such as ADAM, shouldn't do worse.
@QihongL My result show about 2% result at 'translated mode' (I used your 'translated' mode and I've remembered that mode constructs 60x60 MNIST). Note that I didn't fully train my model so that my model can achieve more higher accuracy. If not, I think the differences are from optimization method (e.g. the initial learning rate has not mentioned in this paper so we have to find) and dataset(Definitely, translated version of MNIST is affected from random seed).
@sharpstill your 'PS' is right. There is no guarantee in this model that last glimpse see the 'number of MNIST' but from training, the probability of whole sequence may see the 'number of MNIST' become high
@jtkim-kaist I see. That sounds great! I am finally about to finish my crazy semester and I hope to have a look at your code ASAP.
Thank you very much!
@Hippogriff @QihongL @jtkim-kaist I am so glad to see you helped me a lot! A few question about REINFORCE algorithm still confuse me. look below is RAM paper's gredient about loss function
\pi is become what? and the original REINFORCE algorithm paper: Simple Statistical Gradient-Following Algorithm for Connectionist Reinforcement Learning, its formuler is below:
is eligibility in this formula the same as tf.log(p_loc + SMALL_NUM) in source code? what does eligilibity mean? Still don't fully understand the algorithm, By the way, @Hippogriff , I do not find your formula in paper: @Hippogriff if you have wrote the tutorial about REINFORCE algorithm, Please tell me. Thank you all very much!
By the way , If I want to find mulitple digital in image, How do I modify the loss function?
at first,in function calc_reward, when you calc the J,you use p_loc made by mean_locs and sample_locs, but both the mean_locs and the sample_locs are stop_gradients. so I think tf.log(p_loc + SMALL_NUM) * (R - no_grad_b) is no use when calc the gradients. and why this need to use pretrain.but in paper,i never found this method.
thanks for you release your code,can you solve my doubts, and have you finish this experiment in translate clutter mnist data 100 * 100. if you have,please @me. thanks.