I recently read this work, it's really a good idea and work!
but I have a question, if we change function _expected_with_replacement:
@K.tf.custom_gradient
def _expected_with_replacement(weights, attention, features):
"""Approximate the expectation as if the samples were i.i.d. from the
attention distribtution.
The gradient is simply scaled wrt to the sampled attention probablity to
account for samples that are unlikely to be chosen.
"""
# Compute the expectation
wf = expand_many(weights, [-1] * (K.ndim(features) - 2))
F = K.sum(wf * features, axis=1)
# Compute the gradient
def gradient(grad):
grad = K.expand_dims(grad, 1)
# Gradient wrt to the attention
ga = grad * features
ga = K.sum(ga, axis=list(range(2, K.ndim(ga))))
ga = ga * weights / attention
# Gradient wrt to the features
gf = wf * grad
return [None, ga, gf]
return F, gradient
Hi,
I recently read this work, it's really a good idea and work! but I have a question, if we change function _expected_with_replacement:
to:
that means, we don not use the back-propagation method mentioned in the paper. In this case, end-to-end training can also be done.
Will the experimental results become worse in this condition?
Thanks!