vincentherrmann / pytorch-wavenet

An implementation of WaveNet with fast generation
MIT License
968 stars 225 forks source link

Question about size of receptive field vs input length from dataset #16

Closed ironflood closed 6 years ago

ironflood commented 6 years ago

Hi @vincentherrmann. Thanks a lot for sharing, learning a great deal through your code! This isn't an issue, only a question about the code.

Visually, you describe in code the model input (= receptive_field?) and target being as:

           |----receptive_field----|
                                 |--output_length--|
 example:  | | | | | | | | | | | | | | | | | | | | |
 target:                           | | | | | | | | | |

You also said a few days ago in a similar thread:

The item_length is the number of samples the network gets as input during training and output length is the number of consecutive samples the network outputs. If we were to output only one sample (output length = 1), then there would be item_length=model.receptive_field. But for each additional output sample, we also need an additional input sample, so item_length=model.receptive_field + (output_length-1). Of course during generation we have to set output_length=1.

However from my observation the target data of output_length=16 shares the same values as the end of input sequence generated by WavenetDataset, apart from the last value. Shouldn't the target sequence be the next sequence of data following the input instead? Or put the opposite way, I don't understand why the target sequence has a output_length-1 data overlap with the end of the input, it should be the future data to be predicted of length output_length. Shouldn't the one_hot input sequence be of length model.receptive_field?

To keep it visual like in code, I observe the following:

           |---------one hot input by dataset--------|
           |----receptive_field----|
                                     |--target_length--|
 example:  | | | | | | | | | | | | | | | | | | | | | | |
 target:                             | | | | | | | | | |

Any pointers would be greatly appreciated ) If I had to guess I'd say I'm missing something within the training loop, maybe it includes a moving window of size receptive_field to predict one by one the last_value+1 index?

vincentherrmann commented 6 years ago

I think you're guessing correctly. WaveNet can only ever predict one sample in advance, so all but the last one have to be available as input. The default case, as it is explained in the paper, is output_length=1. Using a greater output_length is just a trick for more effective training. This is because we can reuse most of the calculated values in the hidden layers to compute the next predicted sample. It looks something like this (o is the predicted sample):

           |----receptive_field----|o
             |----receptive_field----|o
               |----receptive_field----|o
                 |----receptive_field----|o
                   |----receptive_field----|o
                     |----receptive_field----|o
                       |----receptive_field----|o
                         |----receptive_field----|o
                           |----receptive_field----|o
                                 |--output_length--|
 example:  | | | | | | | | | | | | | | | | | | | | |
 target:                           | | | | | | | | | |
ironflood commented 6 years ago

Thank you @vincentherrmann, your explanation is much appreciated!