question for the gradient and pertrain

hhhmoan commented 8 years ago

at first,in function calc_reward, when you calc the J,you use p_loc made by mean_locs and sample_locs, but both the mean_locs and the sample_locs are stop_gradients. so I think tf.log(p_loc + SMALL_NUM) * (R - no_grad_b) is no use when calc the gradients. and why this need to use pretrain.but in paper,i never found this method.

thanks for you release your code,can you solve my doubts, and have you finish this experiment in translate clutter mnist data 100 * 100. if you have,please @me. thanks.

qihongl commented 8 years ago

@hhhmoan Hi! Thank you very much for spending time reading the code and pointing out mysterious aspects!

The code actually works WITHOUT the pretraining. I was just a little bit curious to see if pretraining can improve anything. By the way, I just realized that the implementation for the pretraining part incomplete (it is not training the location network)...

On the other hand, I totally understand your suggestion about the stop_gradient for the location output. I will do some testing about that! We are still actively working on debugging. I will report our attempts to replicate the original results by Mnih et al. ASAP!

Thank you very much for your suggestion again!

hhhmoan commented 8 years ago

^_^. To say the truth, i failed to implement this paper in translate clutter mnist data in tensorflow.if you success and release your result,it can really help me.

my best wish to you work

qihongl commented 8 years ago

I'm glad to hear that! We are working on it!

Lzc6996 commented 7 years ago

there are some ops in RAM is not differentiable, and in the paper, use the reward to replace the gradient so that we can do the BP.But in the code, your reward is in the loss, so the reward just like a normal loss and tf will compute the gradient include the ops which is not differentiable. i see that some tf.stop_gradient you use. is that mean you just stop the gradient of these ops? if it is true ,i dont know why the model work,because the params in the LocNet is frozen.

qihongl commented 7 years ago

@Lzc6996 You are right! Our StopGrad operation is incorrect and we still haven't figure out how to resolve this issue.

We failed to realize that because the network still "works". It might be the case that other layers can adapt (by tuning their parameters) the terrible parameters in the LocNet.

JasonZhao001 commented 7 years ago

Qihong, have you work out how to update the parameters in LocNet yet?

qihongl commented 7 years ago

Sorry, not yet. Our code is incorrect for some more fundamental reasons. I am still not exactly sure how to fix it. You can take a look at this repo: https://github.com/zhongwen/RAM

This RAM implementation beats ours on the 28x28 standard MNIST.

JasonZhao001 commented 7 years ago

Thanks for your favorable and prompt reply. And do you think the implementation that you mentioned above realize the parameters updates as Mnih's paper describes?

qihongl commented 7 years ago

@JasonZhao001 I am not sure. I plan to replicate Minh's results with that implementation. I would be more certain if the replication is successful.

qihongl commented 7 years ago

@JasonZhao001 And I also plan to visualize that implementation with tensor board.

JasonZhao001 commented 7 years ago

Yeah, I found that the work done by Zhongwen drops check point and summary(tensorboard), its will be helpful if they are added.
And there are a question that still confuses me when I try to work on your codes : The "DRAW WINDOW" functions don't work on my machine, even if I have set the control parameters to True. I have put a "print" in the animate block and it prints as I set when training but still no windows shows. I wonder its the problem only in my case, so can you tell me if it works in your platform now?

qihongl commented 7 years ago

@JasonZhao001 That's strange. "draw" should work when you set draw to 1. Can you send me the error message? Thanks!

JasonZhao001 commented 7 years ago

There is no error message but just don't show the window as it should be. So I doubt it may because of my platform's problem.

qihongl commented 7 years ago

@JasonZhao001 I see. Let me know if you get more clues about what is going on. I am more than happy to help!

JasonZhao001 commented 7 years ago

@QihongL found the reason. It's because that the matplotlib has something wrong in my platform, when I $sudo pip uninstall matplotlib, it worked! I may be because I had install two versions of matplotlib and when I installed the second one, I set to ignore the excited one. Thanks a lot!

qihongl commented 7 years ago

@JasonZhao001 Great!

Hippogriff commented 7 years ago

@QihongL I have found the error in your gradient implementation. The gradient should flow only via mean_loc not from samp_loc, because the samp_loc gives you the location in the input image from where you should be sampling the next image, and hence become non-differential. But when you define the loss function:

codecogseqn

you back-propagate the gradient of the loss in the computation graph of which mean_loc is the part of, hence you calculate the gradient w.r.t mean_loc. You don't calculate the gradient of the loss w.r.t samp_loc.

EDIT So, if you comment the line:

mean_loc = tf.stop_gradient(mean_loc)

and keep the line:

sample_loc = tf.stop_gradient(sample_loc)

things should work. Let me know if it works for you.

qihongl commented 7 years ago

@GodOfProbability Hi! Thank you so much for pointing this out!

I also think this is causing the trouble. I think I tried commenting that line out before and it didn't work. I guess if I don't stop_grad it, then the gradients flow over time (across different glimpses).

Hippogriff commented 7 years ago

@QihongL I did an experiment on some toy example, and stopping only the gradient from the sample_loc improves the performance and not the other way around.

I guess if I don't stop_grad it, then the gradients flow over time (across different glimpses).

I think it will not propagate the gradient over time because only sample_loc interact with the next time step, not the mean_loc. Hence if you stop gradient w.r.t sample_loc, it is sufficient to stop the "bad" gradient from flowing across time. Furthermore, a thing to keep in mind is that mean_loc are different objects predicted at every time step and you start the back-propagation by finding the gradient w.r.t mean_loc, hence there is nothing coming from the non-differentiable part.

qihongl commented 7 years ago

@GodOfProbability That's very interesting... I will try that! Thank you very much for the suggestion! I will let you know what I find out!

JasonZhao001 commented 7 years ago

@GodOfProbability @QihongL We make an assumption that we don't sampling but use the mean_loc straightly, just similar with the soft attention in "show, attend and tell" Then, the question are:

Do you think it will work well?
Of course, gradient of mean_loc will flow across time in this case. Then, this kind of gradient would be "bad" just as you said? "bad" gradient from flowing across time. then the gradients flow over time (across different glimpses).

Hippogriff commented 7 years ago

@JasonZhao001 If you stop gradient at sample_loc, the bad_gradients will not flow, because only sample_loc interacts across time, and if you stop gradient flowing from sample_loc, you are actually stopping gradient (across time) to flow through mean_loc. However, there are gradients coming from the loss function that corresponds to reward function, wich should flow through mean_loc. The gradient should flow through mean_loc from sample_loc (this gradient comes from the differentiation of the monte carlo approximation of the gradient of the reward function.) If time permits, you should do the experiments, and let us know.

JasonZhao001 commented 7 years ago

@GodOfProbability Yes, you are right! The parameters at location generation module rely on the the derivative of log[P(sample_loc|mean_loc,sigma)] w.r.t. parameters_loc to update, which is actually the derivative of mean_loc w.r.t. parameters_loc. And I will do experiments on it later, and I will try the assumption as well. And will report my results then.

JasonZhao001 commented 7 years ago

@QihongL @GodOfProbability @jlindsey15 @Lzc6996 It proves working well when I comment the line mean_loc = tf.stop_gradient(mean_loc) as Gopal described above. When using the bandwidth = 12, it converge at more than 96% accuracy 600k time step (early stop). And I'm sure that it will get Mnih's result by tuning some parameters when training. By the way, if you stop_gradient at mean_loc, it shows in tensorboard that the parameters here never update when training. And I have a possible reason why it still works in that implementation. It is because that the attention window with three scales especially the 2ed and 3rd one could cover enough information (48 w.r.t. 60), meanwhile, the bandwidth is large enough to recognize a little lower resolution from the 2ed and 3rd scale (1/2 and 1/4 times respectively). So it result relies on the fully connected layers for classification and it is actually the same with the fixed location glimpse to recognition. You can have a try by using only two scales (e.g. 12 and 24) or make the bandwidth smaller (e.g. 8 or 6), then it will not work so well. It is same for the work in https://github.com/zhongwen/RAM Moreover, if you fix the problem, when make the bandwidth smaller (e.g. 8), it will perform better with higher accuracy and faster convergence. I did the experiment that it will converge to 97% at 400k time step! (So I make an early stop) Furthermore, I found it does not apply the M times samples in this original implementation. I plan to try it and any suggestions from you will help it a lot! If I succeed, I will share the sourse code as well. Thanks!

jtkim-kaist commented 7 years ago

@JasonZhao001 @GodOfProbability Hi Jason, I read your comment very interestingly, I also solve the stop_gradient problem, but cannot achieve high performance as like you (in my case, about 94%, translated case) Can I know your hyperparameters and learning strategies? Moreover, except for mean_loc issue, do you think that the baseline in this code is correctly implemented? I think that baseline should be also learnable.. plz give me your opinion! thx!

JasonZhao001 commented 7 years ago

@jtkim-kaist The baseline tech is very important to location prediction. It is learnable as an extra term of cost function as is shown in the source code below:

    J = J - tf.reduce_sum(tf.square(R - b), 1)

Note that the parameters at baseline part is learnt separately with the other to parts. Of course, I have modified some of the hyperparas to make it work better, but I know its not the best, I'm still trying. If I succeed, I will post my implementation later.

jtkim-kaist commented 7 years ago

@JasonZhao001 Thank your for your kind comment,

I also agree with you. However, in this code, the baseline is implemented like below,

baseline = tf.sigmoid(tf.matmul(hiddensState, Wb_h_b)+Bb_h_b)

and, both Wb_h_b and Bb_h_b seems that they can't learn due to stop_gradient function.

when stop_gradient function is off, the baseline depends on hidden_state so that it seems not right as you said (Note that the parameters at baseline part is learnt separately with the other to parts.)

So, I think, "baseline = tf.sigmoid(some independent variable to model)" is more appropriate.

Please give me your opinion thx! (I'm also in proceed)

JasonZhao001 commented 7 years ago

@jtkim-kaist b shouldn't be updated from this term: J = tf.concat(1, [tf.log(p_y + SMALL_NUM) * (onehot_labels_placeholder), tf.log(p_loc + SMALL_NUM) * (R - no_grad_b)]) where no_grad_b = tf.stop_gradient(b) can prevent it to update.

While b is updated via this term: J = J - tf.reduce_sum(tf.square(R - b), 1) where b is not stopped gradient.

jtkim-kaist commented 7 years ago

@JasonZhao001 Thank you! I missed that part.

I'll expect your implementation have a good day!

JasonZhao001 commented 7 years ago

@jtkim-kaist You are welcome :)

qihongl commented 7 years ago

Hi, Folks, thank you so much for thinking about these issues! I really appreciate it! I have been traveling for PhD interviews recently so I didn't had time to work on this project for a while. lol

jlindsey15 commented 7 years ago

Hi, sorry I haven't been active here either! It's a bit unclear to me whether the issue has been resolved or not -- what's the current status?

JasonZhao001 commented 7 years ago

Hi guys, I 'm working on another conference paper which get nearly the submit deadline (3.17), so I have no time with it now. But after that I plan to clean the codes and upload them, meanwhile, a double layers LSTM together with the CNN inside might be uploaded as well. Buy the way, the best test accuracy is 97.8% up to now.

jtkim-kaist commented 7 years ago

Hi I upload my code into my repository with solving this issue,

If you have some time, please visit to 'https://github.com/jtkim-kaist/ram_modified'. Thank you.

qihongl commented 7 years ago

@jtkim-kaist Really appreciate it! I will look at your code ASAP!

Hippogriff commented 7 years ago

You still have stop_gradient in the line. It must be causing problem.

machanic commented 7 years ago

@GodOfProbability your comment is excellent, I am a Chinese PhD student struggle for visual attention model.

Why you say gradient flow through time ( through sample_loc), why you call it "BAD griadient"?

The gradient should flow only via mean_loc not from samp_loc, because the samp_loc gives you the location in the input image from where you should be sampling the next image, and hence become non-differential.

I don't quite understand that author's error, Isn't the samp_loc is calculate from mean_loc? you see, in def get_next_input function: sample_loc = tf.maximum(-1.0, tf.minimum(1.0, mean_loc + tf.random_normal(mean_loc.get_shape(), 0, loc_sd))) Why you say the samp_loc must stop_gradient??? I think don't gradient through samp_loc means the network still don't learn how to put the sensor in the right place ?

machanic commented 7 years ago

@jtkim-kaist why you say the baseline must be stop_gradient? I saw your code , the baseline is set to tf.stop_gradient, Why?

Hippogriff commented 7 years ago

@sharpstill Here are my answers to your questions:

Imagine you sample from r normal distribution such that r ~ N(\mu, \sigma), and then you define a function f (r). Finally you want to find the gradient of f(r) w.r.t \mu and \sigma, you can't do it because you can't back propagate through the sampled value. Same is happening here, you can't back propagate samp_loc which is same as r I described. Hence, you use something called REINFORCE algorithm that uses likelihood ration (Please refer to the RAM paper for more detail). Could you rephrase your second question? I am not able to understand it.

machanic commented 7 years ago

@GodOfProbability My 2nd question is following the 1st question, If you STOP gradient from sample_loc, Then how does the network finally know where is the right place to extract glimpse window? Is it from mean_loc back propagation? Isn't the sample_loc made from mean_loc? I have read that paper, But I don't understand how does REINFORCE algorithm work. It looks same as standard Back propagation in deep learning ? Isn't it? PS: Do you have the correct program source code ??? I have struggle it for lone time, I find it is too hard to debug tensorflow!

Hippogriff commented 7 years ago

mean_loc is the mean of the gaussian from which you sample samp_loc. So, that implies mean_loc makes samp_loc. For now you can treat the gradient flowing from the mean_loc. As I described in my previous comment, to train the policy network (that is essentially the network predicting the mean_loc):

codecogseqn

you need to maximize the above quantity. Intuitively in the above equation, you can think that if you get positive reward and you sample samp_loc closer to the mean_loc, you should motivate the above situation more, and if you get negative reward and still you sampled samp_loc closer to mean_loc, you should discourage the above situation.

In general, you differentiate the above loss w.r.t to the mean_loc only and don't differentiate w.r.t the samp_loc. If you just remove the stop gradient from the mean_loc and keep the stop_gradient at samp_loc, you should get better performance.

I plan to make a tutorial on the REINFORCE and also write this code from scratch in PyTorch after my college exams.

machanic commented 7 years ago

@GodOfProbability Thank you very much for your help. It's very kind of you . If you write a tutorial on the REINFORCE, Can you notify me (tell me) in Email, My Email is sharpstill@163.com what is your email or other contact form? We Chinese use mobile software WeChat ( Instant chat tool software), Do you use WeChat ? Can I add a friend?

Hippogriff commented 7 years ago

@sharpstill You can contact me on gopalsharma@cs.umass.edu. I am afraid, I don't use WeChat.

machanic commented 7 years ago

@GodOfProbability @jtkim-kaist I download the modified code in https://github.com/jtkim-kaist/ram_modified this code seems correct the issue we mentioned. Did you tried to show the image of the last location window?
I want to use this method to find the most discriminative part of the image. I put the mnist digit in random location in 100 x 100 image(use numpy's pad function), and I want to look CAN the RAM model able to find the right location? But look below is the last sample location( After 6000 times iterater) red bounding box is the window, It looks still some kind of random position??? (PS: May be it is last glimpse location, The whole sequence may influence the classify performance? )

hard Figure 1 : last sample location( After 6000 times iterater) red bounding box is the glimpse window

qihongl commented 7 years ago

First of all, big thanks to all of you who provided very insightful comments!

In response to the @sharpstill's comment above: it seems that the training is long not enough or the learning rate isn's appropriate, since I was able to get <10% error even before @jtkim-kaist's improvement. Nothing can be said conclusively without seeing the the learning curve though.

@jtkim-kaist: Have you tried replicating Mnih's results on the translated MNIST task (either 60x60 or 100x100)? If I remember correctly, you were able to get 2% error on the untranslated task, whereas the original paper reports 1.29% (with six 8x8 single scaled glimpses). Do you think this error difference is completely due to hyperparameter tuning or the choice of optimization method? I think the original paper used vanilla batch stochastic gradient descent, so some alternatives, such as ADAM, shouldn't do worse.

jtkim-kaist commented 7 years ago

@QihongL My result show about 2% result at 'translated mode' (I used your 'translated' mode and I've remembered that mode constructs 60x60 MNIST). Note that I didn't fully train my model so that my model can achieve more higher accuracy. If not, I think the differences are from optimization method (e.g. the initial learning rate has not mentioned in this paper so we have to find) and dataset(Definitely, translated version of MNIST is affected from random seed).

jtkim-kaist commented 7 years ago

@sharpstill your 'PS' is right. There is no guarantee in this model that last glimpse see the 'number of MNIST' but from training, the probability of whole sequence may see the 'number of MNIST' become high

qihongl commented 7 years ago

@jtkim-kaist I see. That sounds great! I am finally about to finish my crazy semester and I hope to have a look at your code ASAP.

Thank you very much!

machanic commented 7 years ago

@Hippogriff @QihongL @jtkim-kaist I am so glad to see you helped me a lot! A few question about REINFORCE algorithm still confuse me. look below is RAM paper's gredient about loss function

\pi is become what? and the original REINFORCE algorithm paper: Simple Statistical Gradient-Following Algorithm for Connectionist Reinforcement Learning, its formuler is below:

is eligibility in this formula the same as tf.log(p_loc + SMALL_NUM) in source code? what does eligilibity mean? Still don't fully understand the algorithm, By the way, @Hippogriff , I do not find your formula in paper: ram3 @Hippogriff if you have wrote the tutorial about REINFORCE algorithm, Please tell me. Thank you all very much!

machanic commented 7 years ago

By the way , If I want to find mulitple digital in image, How do I modify the loss function?

jlindsey15 / RAM

question for the gradient and pertrain #10