Closed sgugger closed 4 years ago
So in both case, the dropout should be applied on the matrix of hidden states (I think S4TF returns a list of hidden states and not a matrix batchSize x SequenceLength x hiddenDimension).
Dropout added in https://github.com/tensorflow/swift-models/pull/550.
Currently dropout is not used on the embeddings in the encoder and the decoder because it breaks AD (at least that's what the comment say). See here and there.