Puzzling temperature accuracy average gap

Hi again,

I've been playing with your code for quite sometime now and I found a behavior I can't explain, so I was wondering if you have any idea:

Basically I train a model for a few epochs and use it afterwards to generate a signal, I modified the generate_fast to keep the same format at the target for comparison (int from 0 to 255).

I test two temperature spectrums (experiments on same data):

temperatures_experiment_1 = [0.625, 0.65, 0.675, 0.680, 0.685, 0.690, 0.695, 0.7, 0.705, 0.710, 0.715, 0.720, 0.725, 0.75]
temperatures_experiment_2 = [0.680, 0.690]

For both spectrums and for each timestep (running for thousand of cycles, each of target length = 16) I compute the R2, measuring how "fit" the prediction is to the target int sequence.

Knowing that the prediction is subject to randomization (with temperature) I of course expected the temperatures 0.680 & 0.690 which are by far the best performer in experiment 1 to be slightly different from one test to another (not reproducible), and I imagined the averaged R2 (same with other metrics like acc) should be overall quite similar for each temperature after thousand of iterations.

However this isn't the case, the average metrics I track for those two temperatures are totally different in the two experiments. Temperature = 0.68 in experiment 1 keeps being the best temperature after thousands of iteration and fit best the target data, but become the worst in experiment 2, on the same data.

Do you have by chance any idea on why long term behavior (here = averaged over thousand of iterations) differs so much? Any pointer would be greatly appreciated )

vincentherrmann / pytorch-wavenet

Puzzling temperature accuracy average gap #18