On the left, there is the DDSP autoencoder model except these two performance improvements:
the decoder is a dilated CNN instead of an RNN
the pitch detection model is SPICE, not CREPE
50% of DDSP's runtime is now non-trainable components (processor_group).
RAVE
On the right, there is a RAVE-like model: dilated CNN-based encoder and decoder sandwiched in a 16-band PQMF decomposition. The noise generator is not included here as it lead to the network only using the noise generator to mimic the spectrum instead of the waveform generator. (This still needs inspecting.)
The only major non-trainable component is the PQMF analysis ("preprocessor" in the plot) and synthesis. These can probably be sped up further as well.
The conclusion is that we should work on speeding up RAVE if simply because the focus is on accelerating neural network synthesis and there is more to be gained here by accelerating the ML parts of the model.
This is the most important figure:
DDSP
On the left, there is the DDSP autoencoder model except these two performance improvements:
50% of DDSP's runtime is now non-trainable components (processor_group).
RAVE
On the right, there is a RAVE-like model: dilated CNN-based encoder and decoder sandwiched in a 16-band PQMF decomposition. The noise generator is not included here as it lead to the network only using the noise generator to mimic the spectrum instead of the waveform generator. (This still needs inspecting.)
The only major non-trainable component is the PQMF analysis ("preprocessor" in the plot) and synthesis. These can probably be sped up further as well.
The conclusion is that we should work on speeding up RAVE if simply because the focus is on accelerating neural network synthesis and there is more to be gained here by accelerating the ML parts of the model.