Open loretoparisi opened 8 months ago
Thank you for the well put together issue!
This doesn't seem exceptionally difficult, although we would need to add GroupNorm
support to Ratchet first! Ill open that as a separate issue.
Hi, looking at wav2vec2 params I think that a LayerNorm
can cut it for the implementation.
In the model config, the GroupNorm
is used in the following manner
nn.GroupNorm(num_groups=self.out_conv_dim, num_channels=self.out_conv_dim...
, where out_conv_dim==in_conv_dim==512
, which means 1 group.
I think a permutation of dims and LayerNorm
can help. I am working on #132 but this hack could work for now 🤔
GroupNorm was completed in #192 by @AmineDiro
Add support to Wav2vec2 / Connectionist Temporal Classification (CTC) phoneme models (
Wav2Vec2ForCTC
HuggingFace CTC model class)Motivation The DistilWhisperLargeV2 has impressive results as far as I can see from the provided Space with the NextJS Web app; the perfect companion of Whisper transcription model is the Wav2Vec2 phoneme model. An example of execution of Whisper + Wav2vec2 infact is WhisperX that enables fast automatic speech recognition with word-level timestamps plus speaker diarization.
Other solutions The wav2vec2-service provides a wave2vec implementation for fast cpu inference via ONNX.