can we just reconstruct the wavefrom from fundamental frequency and loudness?

james20141606 commented 4 years ago

Hey, I just got a really good reconstruction result which is too good to be true. I have a sense that the idea behind the model is really good but it is still so amazing to me. I just use your demo autoencoder to reconstruct audios from the human voice and the result is really good. But I could not understand how it can be achieved by only using f0 and loudness information? For example, the vowel 'a' and 'e' is definitely different, how does this be reflected through f0 and loudness? I thought there might be some difference between musical instruments and human voice. I just couldn't understand that these features are enough.

By the way, if I want to add z as latent space besides f0 and loudness, how can I tell the model to use it? I thought you mentioned in the paper that z may correspond to timbre information but I couldn't find it in timbre_transfer.ipynb, can you achieve timbre transfer without z?

james20141606 commented 4 years ago

Another thing which is weird is that when I quantify pearson correlation of spectrograms of original and reconstructed waveform, I found the correlation coefficients are in a very small range. Why is the model so stable at reconstructing the waveform and its corresponding spectrogram? download

james20141606 commented 4 years ago

Another question I am really curious about is if we'd like to do human voice reconstruction from multiple sources(different people), should we consider timbre and include z in the model? Also since the model is really doing a good job on waveform reconstruction. Have you considered to use it on TTS task? Can we use an encoder to generate some features like f0 and loudness from text of some other signal to generate waveform?

jesseengel commented 4 years ago

Hi, glad it's working for you. I'd be happy to hear an example reconstruction if you want to share. My guess is that the model is probably overfitting quite a lot to a small dataset. In that case, segment of loudness and f0 corresponds to a specific phoneme because the dataset doesn't have enough variation. For a large dataset, there will be one to many mappings that the model can't handle without more conditioning (latent or labels). We don't use the latent "z" variables in the models in the timbre_transfer and train_autoencoder colabs, but the encoders and decoders are in the code base and used in models/nsynth_ae.gin as an example.

My intuition is that the model should work well for TTS (the sinusoidal model it's based off is used in audio codecs, so we know it should be able to fit it), but you just need to add grapheme or phoneme conditioning.

james20141606 commented 4 years ago

Thanks a lot for your reply!

I put the reconstruction result analysis here: https://drive.google.com/file/d/1DgjxlMLd-hYtYq4_O99oclqgfliL3Cqx/view
For overfitting issues, I use SHTOOKA dataset which contains audio length around 1hour and 30 min, I think that is not too small for the model to overfit? I am still amazed that the model can handle the data so well, since I have tried parrotron model for spectrogram reconstruction on SHTOOKA dataset and the model could not converge…
I am not sure if I understood more conditioning (latent or labels). here:

For a large dataset, there will be one to many mappings that the model can’t handle without more conditioning (latent or labels).

Do you mean we can add conditionings besides z, f0 and loudness? You also mentioned I could add grapheme or phoneme conditioning for TTS task, do you mean using an encoder to extract phoneme, grapheme or other conditioning and concat with z, f0 and loudness (do we have f0 and loudness in TTS task?) and then feed them to decoder?

I am also curios if I can further improve the result by add z conditioning and use Resnet instead CREPE mode? Or it will be harder to train? Have you try some more complicated model like VAE or GAN using DDSP?

jesseengel commented 4 years ago

There are a lot of options to try, we only have results based on our published work. If you want control over the output, you need to condition with variables that you know how to control. For instance, most TTS systems only use phonemes or text as conditioning, and then let the network figure out what to do with them. You can try to figure out how to interpret Z, but it is not trained to be interpretable as is.

james20141606 commented 4 years ago

Thanks for your reply! For conditioning, do you mean the features after encoder part? If we want more conditioning, do you mean we could try to use some network to encode phoneme or graphene as conditioning? Should I try to make the conditioning similar to similar words? Is it a rule to follow(to find proper conditioning)?

jesseengel commented 4 years ago

The tacotron papers (https://google.github.io/tacotron/) have extensively investigated different types of TTS conditioning. I suggest you check out some of their work.

magenta / ddsp

can we just reconstruct the wavefrom from fundamental frequency and loudness? #40