lmnt-com / diffwave

DiffWave is a fast, high-quality neural vocoder and waveform synthesizer.
Apache License 2.0
767 stars 112 forks source link

Other feature representations besides mel-spect #28

Closed Irislucent closed 2 years ago

Irislucent commented 2 years ago

I'm doing music related research, and mel-spectrogram doesn't seem to be the best data representation for the task I'm handling with, so I'm considering switching to CQT. I trained DiffWave on music Mel-spectrograms and it yielded very impressive result. I'm wondering whether it makes sense to use some other input representations other than Mel-spectrograms, such as CQT? (The representation is informational enough)

sharvil commented 2 years ago

I haven't tried CQT inputs with DiffWave, but I have tried learnt representations. Those experiments were successful so I'd be surprised if CQT didn't work out.

If possible, please consider submitting a PR to add a CQT preprocessing step. I'm sure others working with music would appreciate it. :)

Irislucent commented 2 years ago

Sure I will! But I haven't got any meaningful result for now, training takes quite a lot of time and it destructs my confidence.

Irislucent commented 2 years ago

I'm curious, when you tried those learnt representations, did you change any hyperparameters, or even the model, to make it work? Did you encounter any difference from training with mel-spectrograms?

sharvil commented 2 years ago

I didn't change any hyperparameters. I was using a quantized learnt represenatation (from a VQ-VAE) which is quite different from mel spectrograms. Since adjacent quantized frames are typically discontinuous, I added a convnet to try and smooth out the conditioning signal before it's sent to the rest of the network. That was the only change I remember making.

The experiment was successful in the sense that DiffWave was able to act as a decoder for the quantized inputs. Unfortunately, my VQ-VAE model was poorly tuned so the audio quality was worse than with mel spectrograms.