hangzhaomit / Sound-of-Pixels

Codebase for ECCV18 "The Sound of Pixels"
http://sound-of-pixels.csail.mit.edu
MIT License
371 stars 74 forks source link

Why the model does not go training? #10

Open avis-ma opened 5 years ago

avis-ma commented 5 years ago

Hello, I am a Chinese student. I have pre-processed the dataset, and use the train_MUSIC.sh to train the default model. But the result is not what I supposed. The metrics is all 0. Even I directly use the eval_MUSIC.sh (I have downloaded the trained model), I also get the 0 metics(SDR ,SIR, .etc). I don't change the code that you submit in github. So how can I find what the problem is?

ngmq commented 4 years ago

I am getting the same result - all metrics are 0. @avis-ma Did you solve this?

EDIT: This bug, if it is indeed a bug - would be great if the authors confirmed this, might have come from line 129 in file dataset/base.py:

audio_raw *= (2.0**-31)

From the comment of the authors, this line is supposed to normalize the output of the torchaudio.load() function to the range [-1, 1]. However, this normalization is already done by the torchaudio.load() function itself (see https://pytorch.org/audio/#torchaudio.load). Therefore, what line 129 does is effectively making everything in audio_raw becomes zeros.

This all-zero audio_raw make all the metrics zeros as well, because later in the function calc_metrics() in file main.py, line 150, there is a check whether the ground truth audio is all-zero or not. If it is, then no metric calculation is carried out. Since the audio loaded from the dataset is always all-zero, this check is always true and hence the metrics were never calculated.

The fix is simple: comment out line 129 in file dataset/base.py. After doing so I got something like this:

[Eval Summary] Epoch: 0, Loss: 0.2974, SDR_mixture: 1.4887, SDR: 3.9951, SIR: 9.2085, SAR: 10.6352
Plotting html for visualization...
zjsong commented 4 years ago

I am getting the same result - all metrics are 0. @avis-ma Did you solve this?

EDIT: This bug, if it is indeed a bug - would be great if the authors confirmed this, might have come from line 129 in file dataset/base.py:

audio_raw *= (2.0**-31)

From the comment of the authors, this line is supposed to normalize the output of the torchaudio.load() function to the range [-1, 1]. However, this normalization is already done by the torchaudio.load() function itself (see https://pytorch.org/audio/#torchaudio.load). Therefore, what line 129 does is effectively making everything in audio_raw becomes zeros.

This all-zero audio_raw make all the metrics zeros as well, because later in the function calc_metrics() in file main.py, line 150, there is a check whether the ground truth audio is all-zero or not. If it is, then no metric calculation is carried out. Since the audio loaded from the dataset is always all-zero, this check is always true and hence the metrics were never calculated.

The fix is simple: comment out line 129 in file dataset/base.py. After doing so I got something like this:

[Eval Summary] Epoch: 0, Loss: 0.2974, SDR_mixture: 1.4887, SDR: 3.9951, SIR: 9.2085, SAR: 10.6352
Plotting html for visualization...

Thanks @ngmq , the data scale really matters. However, the training process still cannot converge. After 2 training epochs, the loss hovers around some value, say 0.20, and the predicted two masks are also similar. Did you encounter this problem before?

ngmq commented 4 years ago

@zjsong That did not happen to me, my training went fine. Maybe checking the input data would help?

zjsong commented 4 years ago

@ngmq Thanks for your reply. I just found if the training process takes enough steps forward (e.g., >25 epochs), it would show promising results accordingly.