Provide Attention scores from Transformer

🚀 The feature

Thanks for your amazing contributions.

As far as I understand, the Transformer encoder employed in torchaudio does not provide attention scores in their outputs. Otherwise, please ignore this thread and let me know.

The following line can be saved as attention scores, then can be provided in line 326 return.

https://github.com/pytorch/audio/blob/1717edaa8cddf5068df97e30404d85654f0b55f4/torchaudio/models/wav2vec2/components.py#L317

Instead, the current implementation does not return but only the representations of vectors. Line 326: return output, None

Motivation, pitch

The attention scores of Transformer encoder are very valuable information to design more advanced models. Huggingface implementation allows it by configurations, and it allows other AI researchers to explore new studies, such as the model predictions considering attention scores, the loss function considering attention scores as well.

Alternatives

No response

Additional context

No response

pytorch / audio