Closed voodoohop closed 4 years ago
Hi, thanks for the in-depth study and posting all the resources.
This is actually expected behavior at the moment. As we said in the paper, when training on a full dataset like NSynth the f0 encoder model can get a small loss and learn to generate audio that a CREPE model classifies as having the right f0, but does not currently estimate the correct f0 internally. It often falls into the local minima of predicting an interger multiple of f0 and then doing the best to match the data by manipulating the harmonic distribution. Unlike other neural networks, this problem will be even more exacerbated in fitting a single datapoint (not having the stochasticity of SGD to help in optimization).
We have some follow-up work that overcomes these challenges, and are working on getting in prepared for a conference submission next month, at which time I'll clean it up and submit it to the repo. Sorry for the delay, or if the original paper was misleading, but I think there are actually several ways to tackle this challenge and we should hopefully have them robust and added soon.
Understood. That's good to know and thanks for all the amazing work. I'm really excited about the developments. Should I close this issue for now?
Yah, and I look forward to posting more when I have it :).
Description
I am having trouble training models that don't rely on an f0 estimate from the Crepe pitch estimator. In my tests, whenever fundamental frequency estimation is part of the differential graph I cannot get any convergence of the additive synthesizer at all.
To reproduce it, I create a batch consisting of one sample generated with the additive synth as in the synths and effects tutorial notebook. I then try overfitting an autoencoder on that one sample, with code adapted from the training on one sample notebook.
The decoder uses an additive synthesizer too so, in theory, it should easily reconstruct the sample. Here is a Colab notebook that demonstrates the behavior. In order to make the model converge replace
f0_encoder=f0_encoder
withf0_encoder=None
.Results
Original Audio
Reconstruction with an f0 encoder (3000 training steps)
After the first few training steps, the loss does not improve anymore (around 18.-19.).
Reconstruction with f0 from Crepe (100 training steps)
The model converges immediately with the loss going down to 3. in a short time.
Things I have tried
This is happening just trying to fit one sample. I tried fitting multiple samples too without success.
To Reproduce
Colab notebook