hitachi-speech / EEND

End-to-End Neural Diarization
MIT License
360 stars 57 forks source link

Question about shuffle #13

Closed takenori-y closed 3 years ago

takenori-y commented 3 years ago

In the implementation, acoustic features rather than embeddings are shuffled during training. Is it ok? The positional encoding for the Transformer-based encoder seem to be meaningless feature.

shota-horiguchi commented 3 years ago

Thank you for your interest! We created the positional encoding here, but actually, it is not used in our model so it's ok to shuffle acoustic features in the dataloader. We reported in our ASRU 2019 paper that we did not use positional encodings as below:

The architecture of the encoder block is depicted in Fig. 2. This configuration of the encoder block is almost the same as the one in the Speech-Transformer introduced in [44], but without positional encoding.

I'm sorry for the confusion.

takenori-y commented 3 years ago

Ah, I see. Thank you for answering my question.