Closed jazzsaxmafia closed 9 years ago
We only sample half the time. In other cases, we simple use the expected value as explained in the bottom half of page 5 of the arxiv paper.
Ah so sampling for half of the time, and use deterministic attention for rest of the time. Now it is clear. Thank you.
No problem.
Hello, while analyzing the source code, I found the process of getting alpha_sample by stochastic hard attention quiet not clear, mainly because of the variable 'h_sampling_mask'
The sampling part of the code is (in capgen.py, line 409), alpha_sample = h_sampling_mask * trng.multinomial(pvals=alpha,dtype=theano.config.floatX)\
When h_sampling_mask is 1, alpha_sample would be the sampling result of the multinomial distribution. When h_sampling_mask is 0, however, alpha_sample would be simply alpha.
I though, according to the paper, alpha_sample should be simply alpha_sample = trng.multinomial(pvals=alpha,dtype=theano.config.floatX) which is equivalent to setting h_sampling_mask 1.
Why is "h_sampling_mask" needed?