Why not use torchaudio.compliance.kaldi.fbank???

shashikg / WhisperS2T

An Optimized Speech-to-Text Pipeline for the Whisper Model Supporting Multiple Inference Engine

MIT License

305 stars 31 forks source link

Closed BBC-Esq closed 4 months ago

BBC-Esq commented 5 months ago

I was curious if you've considered using this instead for the spectrogram related extraction stuff?

Apparently, faster-whisper has a seminal pull request that is using it and claims it's way better:

shashikg commented 4 months ago

WhisperS2T already uses the torch audio and has it's own batched implementation for feature extraction (this was one of the optimisation that already added in WhisperS2T): https://github.com/shashikg/WhisperS2T/blob/e7f7e6dbfdc7f3a39454feb9dd262fd3653add8c/whisper_s2t/audio.py#L140-L156