pytorch / audio

Data manipulation and transformation for audio signal processing, powered by PyTorch
https://pytorch.org/audio
BSD 2-Clause "Simplified" License
2.49k stars 644 forks source link

Lazy initialization of MelScale.fb throws when loading module #245

Closed zh217 closed 5 years ago

zh217 commented 5 years ago

Lazy initialization refers to the declaration here and the initialization here. When loading saved model (saved with torch.save()), this causes problem:

size mismatch for reduction.mel_scale.fb: copying a param with shape torch.Size([41, 16]) from checkpoint, the shape in current model is torch.Size([0]).

The problem can be sidestepped by running the model at least once before loading weights. But I'd rather not have to rely on such tricks.

Why is this being lazily initialized anyway?

jamarshon commented 5 years ago

@zh217 can we have small code snippet of the load/save?

zh217 commented 5 years ago

Sure.

from torchaudio.transforms import MelSpectrogram

ms = MelSpectrogram()
ms(torch.randn(1, 1000))

ms2 = MelSpectrogram()
ms2.load_state_dict(ms.state_dict())
jamarshon commented 5 years ago

Thanks for the snippet 😄! Yes I am not sure why that is that way: https://github.com/pytorch/audio/pull/72/files#diff-a3428ec41790ed12faea36fcc6dd8f06R228 The PR says that they do not want to reconstruct the fb matrix every time __call__ is used so they initialize it in __init__ or for the first input.

We can remove lazy initialization by forcing n_stft to be required so that the fb matrix is always created in __init__ (the user must know the dimensions of their matrix). Then for the __call__, it just checks that the input dimensions match expectations and does a matrix multiplication.

Thoughts? @zh217 @dhpollack @vincentqb @cpuhrsch

zh217 commented 5 years ago

Making n_fft required sounds reasonable to me. I can't think of a use case where I need n_fft to be dynamically determined at runtime. Even if there were such a use case, the current implementation doesn't work with changing n_fft anyway.

jamarshon commented 5 years ago

The MelSpectrogram could derive n_stft from n_fft https://github.com/pytorch/audio/blob/master/torchaudio/transforms.py#L202

self.mel_scale = MelScale(self.n_mels, self.sample_rate, self.f_min,  
                          self.f_max, n_stft=self.n_fft // 2 + 1)

which would make your code snippet run correctly

ms = MelSpectrogram()
ms(torch.randn(1, 1000))

ms2 = MelSpectrogram()
ms2.load_state_dict(ms.state_dict())

We could keep also n_stft optional for loading/saving by specifying the shape of the destination module (ms2) for MelScale

ms = MelScale()
ms(torch.randn(1, 1000, 100))

ms2 = MelScale(n_stft=1000)
ms2.load_state_dict(ms.state_dict())
vincentqb commented 5 years ago

The lazy evaluation means that we do not have to specify any dimensions when defining MelScale, and they are then inferred at evaluation time against the matrix used, is that right? This would seem like a nice feature to have for MelScale.

However, for MelSpectrogram, we already know the size of the matrix at initialization, and so we could infer the size at that time. Is that correct? If this is so, then we could go for that as a default behavior since waiting does not give us more information.

jamarshon commented 5 years ago

Going to close this as from #246, we can now load from state_dict for MelSpectrogram in this code snippet https://github.com/pytorch/audio/issues/245#issuecomment-522602035

Also, we can load from state_dict for MelScale (see test_melscale_load_save and test_melspectrogram_load_save)

Concluded that inferring matrix size at runtime for MelScale (i.e. lazy initialization) is a nice property to have as long as it does not impede other use cases. Feel free to reopen if we did not address anything