shashikg / WhisperS2T

An Optimized Speech-to-Text Pipeline for the Whisper Model Supporting Multiple Inference Engine
MIT License
305 stars 31 forks source link

Why not use torchaudio.compliance.kaldi.fbank??? #63

Closed BBC-Esq closed 4 months ago

BBC-Esq commented 5 months ago

I was curious if you've considered using this instead for the spectrogram related extraction stuff?

https://pytorch.org/audio/stable/generated/torchaudio.compliance.kaldi.fbank.html#torchaudio.compliance.kaldi.fbank

Apparently, faster-whisper has a seminal pull request that is using it and claims it's way better:

https://github.com/SYSTRAN/faster-whisper/pull/856

shashikg commented 4 months ago

WhisperS2T already uses the torch audio and has it's own batched implementation for feature extraction (this was one of the optimisation that already added in WhisperS2T): https://github.com/shashikg/WhisperS2T/blob/e7f7e6dbfdc7f3a39454feb9dd262fd3653add8c/whisper_s2t/audio.py#L140-L156