Closed BBC-Esq closed 4 months ago
WhisperS2T already uses the torch audio and has it's own batched implementation for feature extraction (this was one of the optimisation that already added in WhisperS2T): https://github.com/shashikg/WhisperS2T/blob/e7f7e6dbfdc7f3a39454feb9dd262fd3653add8c/whisper_s2t/audio.py#L140-L156
I was curious if you've considered using this instead for the spectrogram related extraction stuff?
https://pytorch.org/audio/stable/generated/torchaudio.compliance.kaldi.fbank.html#torchaudio.compliance.kaldi.fbank
Apparently, faster-whisper has a seminal pull request that is using it and claims it's way better:
https://github.com/SYSTRAN/faster-whisper/pull/856