vade / OpenAI-Whisper-CoreML

OpenAI's Whisper ported to CoreML
141 stars 3 forks source link

Generate Log Mel Spectrograms with vDSP natively #1

Open vade opened 1 year ago

vade commented 1 year ago

Today, we rely on RosaKit as a dependency which is helpful, but not great for portability. Furthermore, RosaKits simd accelerated STFT produces incorrect results, so we are using the un-accelerated variant.

Our codebase has stubs for generating the Log Mel Spectrogram, but for whatever reason our STFT / FFT implementation is producing weird output:

https://www.reddit.com/r/DSP/comments/10dpyzx/help_with_stft_log_mel_spectrogram/

vade commented 1 year ago

Ok, here's a deep dive on the state of the Log Mel Spectrogram code.

The following are the custom implementations

See processData(audio: [Int16]) -> [Float] for our custom code path. See processDataRosa(audio: [Int16]) -> [Float] for the RosaKit path.

The RosaKit path doesn't exactly use Rosakits existing Mel, which makes some assumptions we can't make (it bakes in some post processing that differs).

Debugging output

I've wired up a function to export 1 whisper chunk (3000 x 80 float 32) normalized so you can inspect the output data in say, Photoshop. Just change the path in

private func encode(audio: [Int16]) throws -> MLMultiArray around line 408

I've also exported a Log Mel Spectrogram computed by Python / PyTorch in the exact same format, and have a method to run inference with that, by changing the produced Mel in the encode(audio:) function above

Given a 30s audio clip (see attached), you can export 3 possible Mels ( precomputed from python, working from RosaKit, and the f'ed up one Im making with vDSP).

30s Audio:

A045_C001_0603BW_analyzed-Mono 16kHz Float PCM.wav.zip

VimalMollyn commented 1 year ago

Hi! Thanks for this amazing work! I was able to port this implementation of the log mel spectrogram from python to swift (with vDSP) for one of my projects, so I should be able to help out with this issue! Will get back to you soon!

vade commented 1 year ago

Oh amazing @VimalMollyn - I really appreciate any insight. I'm not an audio guy, and I know from some reading that Apples vDSP / FFT implementation does some funky interleaving of complex output. I expect a newb mistake, but my newb eyes cant find it :)

Thank you again!

vade commented 1 year ago

OOOOOOOkkaaay.

So after a long horrifying look into this, I think for now this is off the table unless someone with way more free time on their hands wants to step up.

A - My updated vDSP FFT implementations matches PyTorch when the FFT is a power of 2, but only then. B - Internally, the Python FFTs use a DFT, which only works on power of 2 inputs. It seems accelerate's API like zrip can work on any length, but its unclear to me if that's just a byproduct of the API, as the results dont match other implementations? C - When its not power of 2, Numpy / PyTorch use a iterative approach incrementing in smaller power of 2 'chunks' and integrating the results, and then interpolating the results? D - vDSP now has a DFT routine which seems to match some of the semantics of (some of ?) the underlying Pocket FFT implementations and idiosyncrasies, ie supported strides and real/ complex stuff, which could drop in to the logic of Numpy above.

Looking closely at https://github.com/numpy/numpy/tree/main/numpy/fft the strategy seems fairly complicated, and matching all of the logic and the numeric output seems like a lot of work.

Numpy uses internally Pocket FFT.c, which RosaKit also uses and wraps in Swift / Obj-C. PyTorch uses PocketFFT.c but ported to use the Python C api as much as possible and return Python structs. This is why RosaKit, Numpy, PyTorch all agree on FFT output.

They all use PocketFFT internally, which doesn't use SIMD / vDSP acceleration.

Sadface.jpg.gif or whatever goes here.

Some maniacs are making projects like this: https://github.com/marton78/pffft which are custom FFT implementations with their own SIMD implementations.

vade commented 1 year ago

In theory, a nice to have would be to just:

vade commented 1 year ago

Oh, I should state, the current state of our Mel work with vDSP is much much closer than before, albeit with numerically incorrect output which I think is due to the above PocketFFT vs vDSP zrip with non power of 2 inputs.

Python: rawMel-rosa-normalized

vDSP w zrip: rawMel-normalized

vade commented 1 year ago

If I, or anyone else actually does any of the above work to match PocketFFT output with vDSP's DFT code, ideally it should be packaged into MatFT which is a Numpy in Swift via Accelerate framework and it would be great to centralize that code where it belongs, and just import MatFT.

hollance commented 1 year ago

Hey do you have a short script for the Python STFT that's being done? I can see your Swift code but not the Python code that it's supposed to correspond to.

vade commented 1 year ago

The Python STFT / Mel code is just the following taken from the OpenAI Whisper code base:

https://github.com/openai/whisper/blob/main/whisper/audio.py#L115

def log_mel_spectrogram(audio: Union[str, np.ndarray, torch.Tensor], n_mels: int = N_MELS):
    """
    Compute the log-Mel spectrogram of
    Parameters
    ----------
    audio: Union[str, np.ndarray, torch.Tensor], shape = (*)
        The path to audio or either a NumPy array or Tensor containing the audio waveform in 16 kHz
    n_mels: int
        The number of Mel-frequency filters, only 80 is supported
    Returns
    -------
    torch.Tensor, shape = (80, n_frames)
        A Tensor that contains the Mel spectrogram
    """
    if not torch.is_tensor(audio):
        if isinstance(audio, str):
            audio = load_audio(audio)
        audio = torch.from_numpy(audio)

    window = torch.hann_window(N_FFT).to(audio.device)
    stft = torch.stft(audio, N_FFT, HOP_LENGTH, window=window, return_complex=True)
    magnitudes = stft[..., :-1].abs() ** 2

    filters = mel_filters(audio.device, n_mels)
    mel_spec = filters @ magnitudes

    log_spec = torch.clamp(mel_spec, min=1e-10).log10()
    log_spec = torch.maximum(log_spec, log_spec.max() - 8.0)
    log_spec = (log_spec + 4.0) / 4.0
    return log_spec

Thanks @hollance for looking into this at all!

vade commented 1 year ago

I also have a test Google Colab to go through and debug results:

https://colab.research.google.com/drive/1r9ghakH8__jGqGiYHC2DXtKaW_ozdSrV?usp=sharing

vade commented 1 year ago

Here is the audio file I have been testing with and the precomputed mel filters taken from Whisper which are loaded in the collab:

mel_filters.npz.zip

A045_C001_0603BW_analyzed-Mono 16kHz Float PCM.wav.zip

hollance commented 1 year ago

Ah I see, the issue is non-power-of-two FFT frame lengths. That's annoying. I wonder why they chose 400 samples for Whisper. (I should really read the paper one of these days hehe.)

Normally you can do this by choosing an FFT length that is the next power of two, and then pad the remaining values outside the window with zeros. So the window is still 400 samples followed 112 zeros to pad the length to 512.

The problem then is that you get 512 frequency bins instead of 400. However, here the goal is to get mel bins, so the number of frequency bins returned by the FFT is not very important — in theory it could still work. You just need slightly different mel filterbanks, that map the desired frequency range over 512 bins instead of 400.

vade commented 1 year ago

Yea, I was thinking the same. I naively tried an experiment to pad to 0 to the next pow2 FFT size and ran vdsp zrip on the results and compared to a PyTorch / numpy rFFT and it was clear even on the FFT side the results were numerically very different, but perhaps I was doing something incorrect or flawed in reasoning.

Thank you for looking into this, and I was thinking that ditching the exact numerical requirements and more just getting a 'close enough, but fast as hell' mel spectrogram might be the better strategy.

vade commented 1 year ago

Yea, I was very curious why that sample count was used as well :)

hollance commented 1 year ago

I tried implementing it, with some amount of success but not 100% yet. ;-)

The idea is to always do the FFT using 512 samples (power of two), but to read the audio in chunks of 400 samples. The hop length remains the same.

The differences are:

import librosa
filters = librosa.filters.mel(sr=16000, n_fft=512, n_mels=80)
filters.tofile("mel_filters.data")

How this works: When you zero-pad the input to the FFT, it gives the same output but spread out over more frequency bins. It's a way to smoothen the frequency spectrum. By using mel filterbanks that are a bit wider than before, we still map the same frequencies to the 80 mel bins. So in theory this should give the same result as the original.

The bad news is that even though the generated mel spectrogram looks quite like the original, there still seems to be a scaling issue of some kind. Not sure where this comes from yet.

EDIT: By "scaling issue" I mean that the frequencies appear to be in the correct places, but the magnitudes are off. It's not a constant factor. The issue isn't so bad once you take the log10 but it's obviously not optimal.

EDIT 2: I experimented a bit with scaling the magnitudes before applying the mel filters. No idea what the right scaling factor should be, but: with 512 elements in the FFT, the magnitudes will be a bit larger than with 400 elements, so perhaps we need to compensate for this by multiplying with 400.0 / 512.0. However, the mean value of the original hann window is 0.5 while the new mean is 0.39 because we added zero padding. That would require multiplying by 0.5 / 0.39 to compensate. As it turns out, these two multipliers pretty much cancel each other out. So maybe no multiplying is needed after all!

The mel frequencies won't be exactly the same as in the original STFT, I guess. But they should still be valid. (One way to verify would be to save the Swift STFT output and then use Python to apply the inverse transform to turn it back into audio. It should be nearly identical to the original audio input.)

vade commented 1 year ago

Thanks @hollance - this is all super helpful. Ill give that a go and see what I come up with - its interesting - when I calculated the zero padded FFT result from vDSP, I was getting similar non linear scaling factors per bin - which I assumed was 'wrong', but perhaps like you mentioned its just the magnitude differences.

Thanks again, ill give it a go!

alahel01 commented 1 year ago

For what is worth, I implemented the same logic using FFTW instead of the Accelerate framework. FFTW allows non-power of two lengths. I did a (non-exhaustive) profiling, and FFTW is faster than vDSP on both my M1 and a bunch of iPhones and iPads.

ldhios commented 2 months ago

Hello, I am currently dealing with real-time audio work and have encountered issues with the STFT's VDSP acceleration. I can only use RosaKit's STFT, but its execution efficiency is very slow. Do you have any progress regarding Swift STFT VDSP?

ldhios commented 2 months ago

Hi! Thanks for this amazing work! I was able to port this implementation of the log mel spectrogram from python to swift (with vDSP) for one of my projects, so I should be able to help out with this issue! Will get back to you soon!

Hello, is there any progress now?