Smoothing the activations at the output of the transformer

Hey there, I was wondering if you encountered any issues related to smoothing the speaker activations predicted using the Transformer model. An encoder only transformer tends to output speaker activations which are not as smooth as the ones provided by other recurrent models (such as Bi-LSTMs and such). Did you resort to some tricks for smoothing the output activations provided by the Transformer or this was not an issue at all?

hitachi-speech / EEND

Smoothing the activations at the output of the transformer #42