pytorch / audio

Data manipulation and transformation for audio signal processing, powered by PyTorch
https://pytorch.org/audio
BSD 2-Clause "Simplified" License
2.53k stars 651 forks source link

torchaudio.compliance.kaldi.fbank dither is different from kaldi #2634

Closed yzwu2017 closed 2 years ago

yzwu2017 commented 2 years ago

🐛 Describe the bug

for the function "torchaudio.compliance.kaldi.fbank", there is an option "dither". The function calls _get_window() function, where dither leads to adding random number in strided_input: image Since "x.log()" and "x" here are based on the same random number, it makes "rand_gauss" not gaussian distributed: use_same_randuniform

However, in the kaldi source code, the random number is gaussian distributed. image

Versions

v0.12.1 (stable release)

ravicodelabs commented 2 years ago

I can try to submit a PR for this, if it's still open.

Also @yzwu2017 RandGauss here is a standard normal I'm assuming? And, a link to the kaldi source code file referenced here would be helpful as well.

ravicodelabs commented 2 years ago

Update: It looks like RandGauss here corresponds to the following, so I believe it should be a standard normal.

inline float RandGauss(struct RandomState* state = NULL) {
  return static_cast<float>(sqrtf (-2 * Log(RandUniform(state)))
                            * cosf(2*M_PI*RandUniform(state)));
}

source: https://github.com/kaldi-asr/kaldi/blob/cbed4ff688a172a7f765493d24771c1bd57dcd20/src/base/kaldi-math.h

Also for context, it looks like the previously reference Dither code is from here:

mthrok commented 2 years ago

It looks like the distribution is skewed due to torch.max(epsilon, ...). How about using torch.randn directly?

ravicodelabs commented 2 years ago

I see the bug now, nice catch @yzwu2017!

I would agree that using torch.randn would be a good solution - it should give similar results to kaldi and is also easier to understand. I can submit a PR for this.

As a side note: I don't think the distribution would be skewed, though. In particular, I think the max(epsilon, ...) was put there to avoid the possibility of generating a uniform variate that is exactly $0$, which will cause issues afterward when the logarithm is taken. Of course, this is irrelevant if we use np.randn instead of the Box Muller transform, which makes the former a good solution too, I think.