Closed adbrebs closed 9 years ago
Sounds like you might need more than a Brick, but maybe a custom Theano op.
Wait, what makes you say that? The actual gradient should only be computed with respect to the rows selected by indices
. It's just that if you ask Theano for the gradient w.r.t. the entire weight matrix, W
, it will allocate a tensor of zeros (the size of W
), fill in the rows with non-zero gradient, and return that to you.
If I delegate the computation of the gradients to a TrainingAlgorithm
such as:
algorithm = GradientDescent(
cost=cost, # with a lookup table in the graph
step_rule=..,
params=ComputationGraph(cost).parameters)
then ComputationGraph(cost).parameters
contains the lookup full matrix W
, whereas it should only be W[indices]
.
Why? You need a shared variable to take the gradients with respect to, returning W[indices]
as a parameter wouldn't work.
Hmm, LookupTable
params is W
. So when you let a TrainingAlgorithm
compute the gradient as I described, then the the gradient will be computed for all the weights of W
(idem for the updates), which is obviously a waste of computations because we already know that only the indices
rows have non-zero gradients. Am I missing something?
Btw, this is confirmed by my experiments: when I increase the vocabulary size of my LookupTable
, the backpropagation takes longer, which should not happen.
Bart is right, though: this is a fundamental problem with Theano, not blocks.
It would be point 3 from https://github.com/Theano/Theano/issues/2219. There's a way around it in Blocks, but I never committed it because I think it should be fix made in Theano. If you need it short-term, give this code a try:
class SubtensorGradientFix(GradientDescent):
"""Gradient descent with support for indexed gradients.
Parameters
----------
subtensor_params : dict
A dictionary of shared variables to the subtensors which ontribute
to the gradient.
"""
def __init__(self, cost, params, subtensor_params=None, *args, *kwargs):
self.subtensor_params = subtensor_params
params = [param
if param not in subtensor_params else subtensor_params[param]
for param in params]
super(SubtensorGradientFix, self).__init__(cost=cost, arams=params,
*args, **kwargs)
def initialize(self):
all_updates = self.updates
# Note: the gradients are computed in the same order in which
# the parameters were given. Keep it like that to ensure
# reproducibility.
reverse_subtensor_params = {value: key for key, value
in self.subtensor_params.items()}
for param in self.params:
if param in reverse_subtensor_params:
all_updates.append((
reverse_subtensor_params[param],
tensor.inc_subtensor(param, -self.steps[param])
))
else:
all_updates.append((param, param - self.steps[param]))
all_updates += self.step_rule_updates
self._function = theano.function(self.inputs, [], updates=all_updates)
There's a typo in there, "aram".
Woops, fixed. I swear it worked though... Maybe this was an older version of the code? Anyway, should give you a place to start.
Thank you both, I will try that then!
With Bart's code and a small hack in the LookupTable code (to save W[indices]), it works just as expected, which means x10 speedups in my experiments on CPU with lookuptables of 100000 rows.
With the current implementation of
LookupTable
, the gradient is computed with respect to all the weights of the lookup table. However we only need to compute the gradient with respect to the weights selected by theindices
required when callingapply(self, indices)
.If the lookup table has dimensions (D,d) and we retrieve k rows, the current gradient costs O(Dd), whereas it should only cost O(kd).