magenta / ddsp

DDSP: Differentiable Digital Signal Processing
https://magenta.tensorflow.org/ddsp
Apache License 2.0
2.86k stars 332 forks source link

training an autoencoder without z #66

Closed james20141606 closed 4 years ago

james20141606 commented 4 years ago

I noticed that in both ddsp/ddsp/training/gin/models/ae.gin and ddsp/ddsp/training/gin/models/ae_abs.gin settings, the model will use z as latent space. I tried to replace Autoencoder.decoder = @decoders.ZRnnFcDecoder() to Autoencoder.decoder = @decoders.RnnFcDecoder() to not use z and test the model's performance, is it the right way? I found that if I did not use z and use ae_abs.gin which jointly learns an encoder for f0, I will get loss nan after around 2000 steps. I doubt if this issue is from z latent missing...

jesseengel commented 4 years ago

We have trained autoencoder models without latents, but you may need to adjust some things. You also want to make sure that the encoder is not making a latents if the decoder is not using them.

james20141606 commented 4 years ago

thanks for the reply! I only tried to change the decoder, for the encoder part I think it might have no influence on the result (even if it still produces a latent which is not used in the decoder, the gradient will not flow through this part and it only influences the training speed)? I also have one not very related questions concerning loudness generation. I know DDSP generates loudness through some rules, what if we want to train an neural network to generate loudness? I found that an ordinary network might fail...

jesseengel commented 4 years ago

The loudness signals for most audio are actually very well behaved, so it should be fine to separately train a network on it (assuming you have enough data).

On Sun, Mar 29, 2020 at 1:30 PM Chen Xupeng notifications@github.com wrote:

thanks for the reply! I only tried to change the decoder, for the encoder part I think it might have no influence on the result (even if it still produces a latent which is not used in the decoder, the gradient will not flow through this part and it only influences the training speed)? I also have one not very related questions concerning loudness generation. I know DDSP generates loudness through some rules, what if we want to train an neural network to generate loudness? I found that an ordinary network might fail...

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/magenta/ddsp/issues/66#issuecomment-605696331, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANFCCJKIXYLFT7IMXHRT7LRJ6VYFANCNFSM4LVI7VSA .

james20141606 commented 4 years ago

I agree that generating loudness from audio using neural networks should not be so hard. But what if trying to generate loudness information from some other kind of signals instead of audios?

jesseengel commented 4 years ago

Yup, that should be fine (my original suggestion was actually to model the loudness autoregressively), and in fact we have some research going on in a related direction currently.

On Sun, Mar 29, 2020 at 1:42 PM Chen Xupeng notifications@github.com wrote:

I agree that generating loudness from audio using neural networks should not be so hard. But what if trying to generate loudness information from some other kind of signals instead of audios?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/magenta/ddsp/issues/66#issuecomment-605698135, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANFCCNZ2JE2OG2332PMMJLRJ6XCJANCNFSM4LVI7VSA .

james20141606 commented 4 years ago

wow, sounds interesting! look forward to your next research! You mean a model like wavenet might be useful to model loudness/f0 from non-audio signal? I am afraid that it will make the model heavier and harder to train.

jesseengel commented 4 years ago

You can also train small autoregressive models like a simple RNN, since the signal is much simpler than an audio waveform. I'm going to close this issue for now.