expected_with_replacement

Hi,

I recently read this work, it's really a good idea and work! but I have a question, if we change function _expected_with_replacement:

@K.tf.custom_gradient
def _expected_with_replacement(weights, attention, features):
    """Approximate the expectation as if the samples were i.i.d. from the
    attention distribtution.
    The gradient is simply scaled wrt to the sampled attention probablity to
    account for samples that are unlikely to be chosen.
    """
    # Compute the expectation
    wf = expand_many(weights, [-1] * (K.ndim(features) - 2))
    F = K.sum(wf * features, axis=1)

    # Compute the gradient
    def gradient(grad):
        grad = K.expand_dims(grad, 1)

        # Gradient wrt to the attention
        ga = grad * features
        ga = K.sum(ga, axis=list(range(2, K.ndim(ga))))
        ga = ga * weights / attention

        # Gradient wrt to the features
        gf = wf * grad

        return [None, ga, gf]

    return F, gradient

to:

def _expected_with_replacement(weights, attention, features):
    wf = expand_many(weights, [-1] * (K.ndim(features) - 2))
    F = K.sum(wf * features, axis=1)

that means, we don not use the back-propagation method mentioned in the paper. In this case, end-to-end training can also be done.

Will the experimental results become worse in this condition?

Thanks!

idiap / attention-sampling

expected_with_replacement #20