How to extract intermediate features of audio by whisper？

Hi, Whisper model which we use in this codebase is based on the original implementation however, for our purposes, we use only the Encoder part of the network (here).

Extracting the "Whisper features" is conducted in the corresponding architectures, e.g. here.

We first prepare the waveform (to ensure it is of the correct length), then convert it to mel-spectrogram and later use it as input to the Whisper's encoder. The output of this part is our front-end which can be later concatenated with other front-ends like MFCC or LFCC.

Piotr

piotrkawa / deepfake-whisper-features

How to extract intermediate features of audio by whisper？ #16