huggingface / ratchet

A cross-platform browser ML framework.
https://ratchet.sh
MIT License
498 stars 28 forks source link

Support to Wav2vec2 #131

Open loretoparisi opened 4 months ago

loretoparisi commented 4 months ago

Add support to Wav2vec2 / Connectionist Temporal Classification (CTC) phoneme models (Wav2Vec2ForCTC HuggingFace CTC model class)

Motivation The DistilWhisperLargeV2 has impressive results as far as I can see from the provided Space with the NextJS Web app; the perfect companion of Whisper transcription model is the Wav2Vec2 phoneme model. An example of execution of Whisper + Wav2vec2 infact is WhisperX that enables fast automatic speech recognition with word-level timestamps plus speaker diarization.

Other solutions The wav2vec2-service provides a wave2vec implementation for fast cpu inference via ONNX.

FL33TW00D commented 4 months ago

Thank you for the well put together issue!

This doesn't seem exceptionally difficult, although we would need to add GroupNorm support to Ratchet first! Ill open that as a separate issue.

AmineDiro commented 3 months ago

Hi, looking at wav2vec2 params I think that a LayerNorm can cut it for the implementation. In the model config, the GroupNorm is used in the following manner nn.GroupNorm(num_groups=self.out_conv_dim, num_channels=self.out_conv_dim..., where out_conv_dim==in_conv_dim==512, which means 1 group. I think a permutation of dims and LayerNorm can help. I am working on #132 but this hack could work for now 🤔

FL33TW00D commented 2 months ago

GroupNorm was completed in #192 by @AmineDiro