skaae / Lasagne-CTC

CTC for lasagne
Apache License 2.0
25 stars 3 forks source link

What the comment mean? #1

Open ghost opened 9 years ago

ghost commented 9 years ago

The comment of pseudo_cost() says: "This cost should have the same gradient but hopefully theano will use a more stable implementation of it." What does this mean actually? Is this implementation not stable for now?

skaae commented 9 years ago

Not entirely sure. You'll have to ask the original author mentioned in the docs.

mpezeshki commented 9 years ago

As far as I remember, the only difference is normalization. So one can get the gradients from pseudo_cost and use cost for monitoring purposes.

kshmelkov commented 9 years ago

Honestly I can't understand the purpose of that function. I couldn't find any equivalent in the original code. As far as I understand, pseudo_cost computes gradients manually instead of relying on theano autodiff. It seems that in the example from @mohammadpz repo only cost is used for both of training and monitoring.

pbrakel commented 9 years ago

Hey, sorry for the unclear comment. I think I wrote that more as a note to myself somehow and it refers to the fact that I feared the gradient might still be unstable without using the skip_softmax option (as turned out to be true). The pseudo_cost function computes the gradient manually first and than combines it with the input values to obtain a score that fools theano into retrieving the gradient again.

As an example, the gradient of the categorical cross entropy is something like $-t/y$ and after multiplying it with the softmax derivative it becomes $y-t$, where $y$ is the softmax output and $t$ your desired label in one-hot coding. The first of these two gradients is numerically quite risky due to the possible division by zero so it would be nice if we could skip it to get to $y-t$ directly. Knowing this, we can simply compute $y-t$ by hand but we need to give some cost to theano to compute the gradient of, such that it will use the chain rule and multiply that gradient with the other derivatives it will compute. By substituting $y-t$ with some matrix/vector $a$ we consider constant (i.e., we don't try to propagate gradients through it), we can write $L=sum(a*o)$, where $o$ is the output before it goes into softmax. Theano will conclude that the gradient of this cost wrt $o$ is $a=y-t$ even though $L$ will most likely be very different from the actual cross entropy and doesn't need to be positive.

This is what pseudo_cost tries to do for CTC because the original cost was numerically unstable. If my reasoning is wrong please let me know but so far we've gotten decent results with this CTC implementation. I fully admit it's not the most beautiful solution and it would probably be nicer to write a theano op that does this but I didn't find the time for that yet.

kshmelkov commented 9 years ago

It makes much more sense now, thank you. However, I don't see how it is specific to CTC cost. If it is related only to softmax/cross-entropy, it must be a trouble for almost any convnet implementation. Do you suggest that theano's backpropagation of the categorial cross-entropy is numerically unstable in general?

pbrakel commented 9 years ago

I remember some implementations of it being more reliable than others. The one taking indices seems more stable than the one that expects one-hot coding if I remember correctly. The problem is also that our batch version of CTC needs to propagate zeros in log domain, which leads to some computations that might lead to things like inf - inf or inf * 0.

kshmelkov commented 9 years ago

Well, I have done some experiments on my tasks. I agree that pseudo_cost behaves somewhat more stable, but I couldn't find a pattern (i.e. the effect is inconsistent). For my tasks rmsprop and adadelta are stable enough even using cost.

Anyway I suggest that it should be solved on Theano level. As I said log(softmax(.)) is very common function, it has to be treated correctly. I have done some googling, this problem was noticed and reported in Theano upstream a few times already: Theano/Theano#2944, Theano/Theano#2781, mila-udem/blocks#654. It seems also that Theano contains a related optimization, but I don't understand its semantics (it is buried in cuDNN). Somebody mentioned very different stability depending on mode=FAST_RUN or FAST_COMPILE (which makes sense if it is just an optimization).

What I took away from these discussions is that Theano can optimize log(softmax(.)) (on CPU also), but sometimes doesn't. Presumably because of scan between two operators. @pbrakel, might it be the case in CTC?