🌟 New model addition

Simultaneous Speech-to-text Translation using Monotonic Multihead Attention(MMA). I am wondering if anybody is working on implementing this model for now. However, I am worried that if this model is going to be supported by Hugging Face systems, since inference works in a particular way using frameworks like SimulEval to simulate streaming input which may not be compatible with current Hugging Face's inference system?

Model description

MMA(Ma et al., 2019) has been used to handle streaming text/speech inputs mostly for translation, where MMA extends the monotonic attention mechanism to multihead.

Open source status

[x] the model implementation is available: Fairseq Implementation is available here
[ ] the model weights are available: (give details)
[ ] who are the authors: (mention them, if possible by @gh-username) : Xutai Ma(@xutaima), Juan Pino, James Cross, Liezl Puzon, Jiatao Gu

Inference framework : Facebook Research SimulEval

huggingface / transformers

Support for Monotonic Mulithead Attention based Simultaneous Speech-to-text Translation #15491

🌟 New model addition

Model description

Open source status