question for the gradient and pertrain

hhhmoan commented 8 years ago

at first,in function calc_reward, when you calc the J,you use p_loc made by mean_locs and sample_locs, but both the mean_locs and the sample_locs are stop_gradients. so I think tf.log(p_loc + SMALL_NUM) * (R - no_grad_b) is no use when calc the gradients. and why this need to use pretrain.but in paper,i never found this method.

thanks for you release your code,can you solve my doubts, and have you finish this experiment in translate clutter mnist data 100 * 100. if you have,please @me. thanks.

Hippogriff commented 7 years ago

The formula I gave was for the loss function. But the formula given in the RAM paper is for gradient, which is the same as mine but with a gradient symbols. Since, in tensorflow we use automatic differentiation, we don't care about the gradient expression, we just specify the loss function. I am sorry I haven't made the tutorial yet, since I have been busy with other stuff.

machanic commented 7 years ago

@hhhmoan @QihongL @Hippogriff @jtkim-kaist HELP, Anyone who knows how to modify the code to support multiple objects recognition? If a image has multi-label, for example, more than one mnist digits were place on the image. I read a paper arxiv:1402.7755v2, I thought the paper written: the multi-target learning is just a sequential learning of one-target learning for many times which is equal to the objects number in one image. and sum the loss up to back-prop. When I modify the code to do it, I failed! The RAM model seemed only want to find the first objects it discovered and don't care the rest of the objects. Do you help me please, Thank you very much!

qihongl commented 7 years ago

Hi, @jtkim-kaist,

I just played with your modification a little bit. I was able to get 2% error on the translated task (which is slightly worse than the original paper). However, more importantly, I couldn't replicate the result on the untranslated (28x28) version of the task either. Were you able to achieve 1% error? If so, what were the hyperparameter values?

If you haven't tried that, I guess there is something qualitatively different between our implementation and the original model.

jtkim-kaist commented 7 years ago

@QihongL

Sorry for very late answer. (because I was very busy : ( ) According to my experiments, the error was reduced to 1.81 % with same hyper-parameters. Furthermore, I found that the performance was different whenever I tried to train the model, because of the randomized initialization method which cannot be fixed if we use the GPU. (I found that this was the problem of tensorflow; this problem might be fixed in latest version or not) Therefore, I guess if we fortunately get a good initialization point, the performance will be achieved to 1 %.

qihongl commented 7 years ago

@jtkim-kaist don't worry about it! I totally understand! And thanks for the reply!

I see! When you achieved 1.81%, how long does it took (with GPU)?

vhvkhoa commented 7 years ago

@Hippogriff I'm having a question when I read your loss function which is this :

I understood that it would encourage samples near the right positions (which leads to right classification) and discourage samples near the wrong positions (which leads to wrong classification). But in the code, they fixed the stddev of the sampling module to a constant. So, does the stddev part of the PDF become meaningless ?

CuiSir commented 6 years ago

@Hippogriff I have a question about the loss function, Why is this item in its loss?

J = J - tf.reduce_sum(tf.square(R - b), 1)

Hippogriff commented 6 years ago

This is here to train the baseline in supervised settling, it is basically a mean square error. Baseline can be seen as a value function and R is an unbiased estimator of the value function so it is okay to use R to train the baseline.

jlindsey15 / RAM

question for the gradient and pertrain #10