As far as I understand, the Transformer encoder employed in torchaudio does not provide attention scores in their outputs. Otherwise, please ignore this thread and let me know.
The following line can be saved as attention scores, then can be provided in line 326 return.
Instead, the current implementation does not return but only the representations of vectors. Line 326: return output, None
Motivation, pitch
The attention scores of Transformer encoder are very valuable information to design more advanced models. Huggingface implementation allows it by configurations, and it allows other AI researchers to explore new studies, such as the model predictions considering attention scores, the loss function considering attention scores as well.
Thanks for the suggestion and I think this is a good addition. We need to think of how to actually enable this, preferably with keeping the backward compatibility.
🚀 The feature
Thanks for your amazing contributions.
As far as I understand, the Transformer encoder employed in torchaudio does not provide attention scores in their outputs. Otherwise, please ignore this thread and let me know.
The following line can be saved as attention scores, then can be provided in line 326 return.
https://github.com/pytorch/audio/blob/1717edaa8cddf5068df97e30404d85654f0b55f4/torchaudio/models/wav2vec2/components.py#L317
Instead, the current implementation does not return but only the representations of vectors. Line 326: return output, None
Motivation, pitch
The attention scores of Transformer encoder are very valuable information to design more advanced models. Huggingface implementation allows it by configurations, and it allows other AI researchers to explore new studies, such as the model predictions considering attention scores, the loss function considering attention scores as well.
Alternatives
No response
Additional context
No response