vincentherrmann / pytorch-wavenet

An implementation of WaveNet with fast generation
MIT License
968 stars 225 forks source link

Why the output of Wavenet generate() always the same? #11

Open littleTwelve opened 6 years ago

littleTwelve commented 6 years ago

When I use the code to train a model, it seems good. However, when I use the trained model to generate data, I get a sequence of number which are all the same value. For example, if I input 1*5000 vector [2,99,34,...,45, 27,33], then I use generate() to generate data, I get [2,99,34,...,45, 27,33,33,33,33,...,33,33,33]. As you see, I generate a sequence of number which are all the same value and what is more strange is that these numbers are all equal to the last number of the input. I can't find what's wrong with the code, I would appreciate it if someone can give me some advice.

vincentherrmann commented 6 years ago

Have you tried the generate_fast() method? I think there is probably a bug in the generate() function. I will try to fix it, but you shouldn't really be using it anyway since it's painfully slow.

littleTwelve commented 6 years ago

Thank you for your reply! I will try it later. Actually, I rewrite your code rely on my understanding of it. So I use generate() just because I can understand it clearly. As for generate_fast(), I can't understand it well. You said there is probably a bug in the generate() function. Although I can't use generate() to get what I want, I have no idea on what's wrong with it. Could you explain it clearer?

vincentherrmann commented 6 years ago

The problem in the generate function() was simply that I didn't do one-hot-encoding of the input. I have fixed it now, but let me know if there's something I can do to help your understanding!

littleTwelve commented 6 years ago

Thanks again! I just found I have a big misunderstanding of the wavenet and I'm trying to correct it. So I'm afraid that I may discuss the generate_fast() method with you after 1 or 2 days. I'm so sorry for that.

littleTwelve commented 6 years ago

I wonder why you need dilate() in wavenet_modules.py but not just use the parameter 'dilation' in nn.Conv1d?

littleTwelve commented 6 years ago

In your code, there is '(N, C, L), where N is the input dilation', but based on nn.Conv1d 'N' is the batch size, so I don't know why N is the input dilation?

vincentherrmann commented 6 years ago

Here I answered the question regarding the dilate() function. The convolution is executed in parallel for every index in the first dimension, which in the wavenet architecture is both the dilation and the batch number. So, to be exact, N = dilation * minibatch_count.

littleTwelve commented 6 years ago

Thanks! I also have a question about the item length. In your code, item_length = receptive_field+output_length-1 and I found your output_length is always some small number like 32,48,16. What I used to do at training stage is that I set item_length to be a large number for example 21600 (because I seem to remember DeepMind mentioned in their paper that they need 2 minutes data to generate 1 second data) , which may correspond a very large output_length or a deeper wavenet in your code. And then I just use the output which length is 17507 (if receptive_field is 4093, then 21600 - 4093 = 17507) to do the cross entropy. I want to know whether my idea is reasonable or not?

vincentherrmann commented 6 years ago

Intuitively it makes sense for the output_length to have the same order of magnitude as the receptive field of the model. Currently I use an output length of 4096 most of the time (you can see the stuff I'm working on in the parallel branch). If the output length is longer the computation time increases linearly and it would be better to use bigger mini batches instead. I'm not sure which passage of the paper you're referring to, though...

littleTwelve commented 6 years ago

Wow! You are awesome! I happen to learn how to make a conditional wavenet in the few next months. I think I will bother you a lot in the next few months. Could you tell me what is your conditioning input? Types of music or something else?

vincentherrmann commented 6 years ago

I'm trying to make the model learn the structure of a piece/song and condition the wavenet on a local time embedding. Hopefully this allows to generate longer and more musically interesting sequences. It's a bit complicated, if it works I will write a blog post about it.

littleTwelve commented 6 years ago

That's great! I am looking forward to it.

littleTwelve commented 6 years ago

It seems that my problem mentioned 2 days before has nothing to do with the generating function. I can use your code to generate a sine wave very well. If I use my own dataset, I got nothing but a straight line, but the training loss is 1e-08.

littleTwelve commented 6 years ago

Is there any trick for how to train a wavenet? No matter how to change the wavenet's parameters, I've got nothing but a straight line. Do you have some suggestions?@vincentherrmann

HTT1995 commented 5 years ago

I had the same problem.Not only the generate function ,but also the trainning result. I always got the raw audio output,such as: [20 20 20 20 20 20 20 20...], I don't know why.I check my code very carefully ,but it doesn't work.I'll very appreciate if someone can help me.

littleTwelve commented 5 years ago

I think maybe you could increase the value of mu, residue and skip. For example, mu= 64, skip=64 and residue=512. I solved my problem just by this way.

HTT1995 commented 5 years ago

Thank you for your advice, I have tried different combinations of these parameters,but it seemed doesn't work. the input [batchsizeclasseslength] through the all conv layer ,and then I find each column of the one_hot output [batchsizeclassesoutput_length] is very similar, so after de_one_hot, the raw audio output[batchsize1output_length] is the same. Do you have some other suggestions?

ZXY1231 commented 3 years ago

Thank you for your advice, I have tried different combinations of these parameters,but it seemed doesn't work. the input [batchsize_classes_length] through the all conv layer ,and then I find each column of the one_hot output [batchsize_classes_output_length] is very similar, so after de_one_hot, the raw audio output[batchsize_1_output_length] is the same. Do you have some other suggestions?

Hello, I met the same problem like yours, did you solve your problem?