Closed tristandeleu closed 9 years ago
Somehow this seems to be due to the normalization with theano.tensor.norm
in the cosine_similarity
, and more precisely to the use of T.abs_
in the computation of T.norm
. I switched to a manual computation of the norm (with T.sqrt(T.sum(x * x))
) and it worked fined, even in FAST_COMPILE
mode. This may be specific to the use of the normalization with T.norm
and scan
as it did not happened when unrolling the network through time instead of using scan
.
During the computation of the gradient with backpropagation, it sometimes outputs NaN values when compiling in
FAST_COMPILE
mode. When we compute the gradient of the cost wrtW_wr_add
(orb_wr_add
) with backpropagation, it outputs NaNs at the last step of the gradient computation of the cost wrt the hidden state for the first step. It seems to come from the the initialization ofh_0
as a zero vector (no issue with a uniform Glorot initialization). It also seems to be specific for therectify
activation used for theadd
vector (no issue with other activations likeidentity
ortanh
). Finally it works as intended inFAST_RUN
mode with zero initialization andrectify
activation.