mcf06 / theano_ctc

Theano bindings for Baidu's CTC library.
BSD 3-Clause "New" or "Revised" License
20 stars 5 forks source link

Trouble with enable/disable gradient calculation mechanism #5

Closed DingKe closed 8 years ago

DingKe commented 8 years ago

I tried to build up two graph with theano_ctc, one for training (need gradient) and another for testing (need no gradient). If I define the training graph before the testing one, at run time, I would encouter a segment fault.

After some debugging, I found that the testing graph definition would disable the gradient calculation (via shared variable), which will disable the training graph's gradient calculaton accidently. So at runing time the training graph will get a NULL when it tries to get the gradient, thus a segment fault occurs.

Aa a workaround, I have to make local_GpuCtc_no_grad (and local_CpuCtc_no_grad) set the variable computeGradient to 1 all the time (disale the gradient optimization).

So maybe a more robust mechanism is needed to optimize the gradient calculation. But I don't have a clue how to do it.

mcf06 commented 8 years ago

I think I've figured out a way to handle this. Consider this code:

# CTC cost
tCost = ctc_cost(tsActs, tsLabels, tsActT)

# Gradient of CTC cost
tGrad = T.grad(T.mean(tCost), tsActs)

# Create train (with gradient for SGD) and test (no gradient) functions
train = theano.function([], [tCost, tGrad])
test = theano.function([], [tCost])

A single GpuCtc Op is generated by the call to ctc_cost(). However, when the train and test functions are compiled, there are different Apply nodes referencing this Op.

I've changed the optimization to set the computeGradient input to 1 when the Apply node has an output client for the gradient vs 0 when it doesn't.

nouiz commented 8 years ago

The way we normally handle a case like this would be:

This way will work well with theano.grad. The optimization build is simple to do.

mcf06 commented 8 years ago

Good suggestion, thanks. I've refactored the code a fair bit, and, in the new version, make_node() always assumes a gradient computation since there's no way to tell when the Apply node is first created. The optimization will substitute a cost-only Op and remove the (unused) gradient output from the optimized Apply node.