Open hhhmoan opened 8 years ago
The formula I gave was for the loss function. But the formula given in the RAM paper is for gradient, which is the same as mine but with a gradient symbols. Since, in tensorflow we use automatic differentiation, we don't care about the gradient expression, we just specify the loss function. I am sorry I haven't made the tutorial yet, since I have been busy with other stuff.
@hhhmoan @QihongL @Hippogriff @jtkim-kaist
HELP, Anyone who knows how to modify the code to support multiple objects recognition?
If a image has multi-label, for example, more than one mnist digits were place on the image.
I read a paper arxiv:1402.7755v2,
Hi, @jtkim-kaist,
I just played with your modification a little bit. I was able to get 2% error on the translated task (which is slightly worse than the original paper). However, more importantly, I couldn't replicate the result on the untranslated (28x28) version of the task either. Were you able to achieve 1% error? If so, what were the hyperparameter values?
If you haven't tried that, I guess there is something qualitatively different between our implementation and the original model.
@QihongL
Sorry for very late answer. (because I was very busy : ( ) According to my experiments, the error was reduced to 1.81 % with same hyper-parameters. Furthermore, I found that the performance was different whenever I tried to train the model, because of the randomized initialization method which cannot be fixed if we use the GPU. (I found that this was the problem of tensorflow; this problem might be fixed in latest version or not) Therefore, I guess if we fortunately get a good initialization point, the performance will be achieved to 1 %.
@jtkim-kaist don't worry about it! I totally understand! And thanks for the reply!
I see! When you achieved 1.81%, how long does it took (with GPU)?
@Hippogriff I'm having a question when I read your loss function which is this :
I understood that it would encourage samples near the right positions (which leads to right classification) and discourage samples near the wrong positions (which leads to wrong classification). But in the code, they fixed the stddev of the sampling module to a constant. So, does the stddev part of the PDF become meaningless ?
@Hippogriff I have a question about the loss function, Why is this item in its loss?
J = J - tf.reduce_sum(tf.square(R - b), 1)
This is here to train the baseline in supervised settling, it is basically a mean square error. Baseline can be seen as a value function and R is an unbiased estimator of the value function so it is okay to use R to train the baseline.
at first,in function calc_reward, when you calc the J,you use p_loc made by mean_locs and sample_locs, but both the mean_locs and the sample_locs are stop_gradients. so I think tf.log(p_loc + SMALL_NUM) * (R - no_grad_b) is no use when calc the gradients. and why this need to use pretrain.but in paper,i never found this method.
thanks for you release your code,can you solve my doubts, and have you finish this experiment in translate clutter mnist data 100 * 100. if you have,please @me. thanks.