Benchmark/profile baselines to identify performance bottlenecks

This is the most important figure:

DDSP

On the left, there is the DDSP autoencoder model except these two performance improvements:

the decoder is a dilated CNN instead of an RNN
the pitch detection model is SPICE, not CREPE

50% of DDSP's runtime is now non-trainable components (processor_group).

RAVE

On the right, there is a RAVE-like model: dilated CNN-based encoder and decoder sandwiched in a 16-band PQMF decomposition. The noise generator is not included here as it lead to the network only using the noise generator to mimic the spectrum instead of the waveform generator. (This still needs inspecting.)

The only major non-trainable component is the PQMF analysis ("preprocessor" in the plot) and synthesis. These can probably be sped up further as well.

The conclusion is that we should work on speeding up RAVE if simply because the focus is on accelerating neural network synthesis and there is more to be gained here by accelerating the ML parts of the model.

vvolhejn / thesis

Benchmark/profile baselines to identify performance bottlenecks #6

DDSP

RAVE