Open FrancescoFrontino opened 5 years ago
Hi thanks for comment! Do you have a reproducible example? I've never used pad_sequences
myself.
In any case (when it's working) mask layer will multiply loss function by 0/1 mask if all above layer propagates the mask. So if any of the outputs is NaN
then endresult would be NaN
after summation
I encountered some problems using the masking layer. The network, instead of skipping the padded timestamps, computes the gradients obtaining nan values. More in detail, I have padded the sequences with the value -1.0 using the pad_sequences function implemented in keras. Then, I have trained the model using the train_on_batch method.
Do you already face these kinds of problems?
Can be this explanation a reason for such problems? "If any downstream layer does not support masking yet receives such an input mask, an exception will be raised." -- keras documentation