pytorch / audio

Data manipulation and transformation for audio signal processing, powered by PyTorch
https://pytorch.org/audio
BSD 2-Clause "Simplified" License
2.53k stars 650 forks source link

Some issues with Kaldi MFCCs features #263

Open speechbrain opened 5 years ago

speechbrain commented 5 years ago

Hi, I'm trying to do some experiments with the kaldi-compliant MFCCs, but I run into some possible issues:

1- When I run the following code

 file='/home/mirco/datasets/TIMIT/test/dr5/fnlp0/si1308.wav'
 [signal,fs]=sf.read(file)
 signal=torch.from_numpy(signal).unsqueeze(0).float()
 fea=mfcc(signal)
 print(fea)

The mfccs are different every time I run the script:

run 1

tensor([[ 29.2496, -32.6150,  -7.1791,  ...,  -6.2034,  -5.8100,   3.5894],
       [ 28.1680, -35.9921,  -8.5621,  ..., -13.5980,  -4.2804,  -8.8075],
       [ 29.2831, -31.8580,  -8.8565,  ...,  -5.8166,  -4.2538,   6.4913],
       ...,
       [ 27.5078, -36.1139, -12.1319,  ..., -11.6493,   0.2557,  -4.9566],
       [ 28.9667, -33.5803,  -6.6644,  ...,  -6.1208,   2.7111,   2.7867],
       [ 28.6988, -33.6590, -12.0312,  ...,  -3.0909,  -0.0643,  -4.1769]])

run 2

tensor([[ 27.8255, -33.2356,  -8.8006,  ..., -13.2640,   1.0311,   4.8004],
       [ 29.5605, -34.0147, -10.3465,  ...,  -4.0096,  -1.5156,  -3.2499],
       [ 29.3978, -31.3415,  -6.4141,  ...,  10.6100,   2.3651,   6.1324],
       ...,
       [ 29.4321, -33.0013, -11.9812,  ...,  -2.9076,   6.3498,   1.8854],
       [ 28.2726, -34.0620,  -9.5291,  ...,  -5.4033,   6.0385,  -0.1867],
       [ 29.5408, -33.7757,  -9.1063,  ...,   5.8796,  -6.1365,  -3.2730]])

This is due to the dithering, that is a type of noise. The problem can be easily solved by setting a manual_seed before executing the code. To avoid issues, the users should be aware of that. Maybe it could be great to provide an example in the documentation.

2- Even if I remove the dithering in both kaldi and torchaudio mfccs, the two vectors are very very different (the options in the two cases are exactly the same):t

torch.audio

> tensor([[-64.7435, -23.0893,   1.5796,  ...,  -4.9001,  -1.5039,  -2.7683],
>        [-61.5527, -17.7455,  -5.9670,  ...,   4.4663,   2.5523,  -0.9595],
>        [-58.9998, -21.4523, -10.7197,  ...,  10.2993,   9.5475,  -0.3667],
>        ...,
>        [-65.0258, -23.9535,   3.2329,  ...,   4.1740,   9.9711,  -1.2087],
>        [-65.4491, -23.4586,   3.0314,  ...,   3.4530,  -0.2666,  -3.1916],
>        [-65.9383, -23.1859,   3.6318,  ...,   2.9440,   3.5066,  -2.4232]])

kaldi

fnlp0_si1308  [
 33.93769 -26.93453 -4.314013 -9.108547 -2.538414 -7.403401 -7.393436 -19.1162 2.36114 -3.599539 -8.258158 -3.048464 -2.534939
 37.16539 -20.76378 -10.65134 -14.69143 -5.084549 -13.17811 -19.8767 -11.37231 0.9925694 3.125628 1.008414 0.7657758 -1.482104
....

3- When I put the tensor into the GPU (with .to('cuda')), I have the following error: log_energy = torch.max(strided_input.pow(2).sum(1), epsilon).log() # size (m)RuntimeError: Expected object of backend CUDA but got backend CPU for argument #2 'other'

How can I compute the MFCCs using cuda?

4- A good thing is that the execution time (even on a single cpu only) is compatible with the kaldi one (around 15 second for the entire TIMIT dataset). I would expect a further speed up on the GPU.

5- I tried to feed the MFCCs coefficient inside a standard speech recognizer and the performance with the original kaldi coefficients is still much better. Do you have the same experience?

Thank you and thanks for developing this very useful toolkit!

vincentqb commented 5 years ago

Let's start with 1. I don't see the fluctuations that you mention. I'm using this file.

import soundfile as sf
import torch
import torchaudio

filename = "sample.wav"
waveform, sample_rate = sf.read(filename)
waveform = torch.from_numpy(waveform).mean(-1).unsqueeze(0).float()

mfcc = torchaudio.transforms.MFCC()

mfcc(waveform)
mfcc(waveform)
mfcc(waveform)
mfcc(waveform)
mfcc(waveform)

I get the same output after each call mfcc(waveform).

tensor([[[-7.3096e+02, -7.3096e+02, -7.1735e+02,  ..., -5.1697e+02,
          -5.3175e+02, -5.0290e+02],
         [ 5.2458e-06,  5.2458e-06,  1.3515e+01,  ...,  1.3600e+02,
           1.2202e+02,  1.3729e+02],
         [-7.4861e-05, -7.4861e-05,  1.3204e+01,  ...,  9.8696e+00,
           4.6198e+00,  7.2932e+00],
         ...,
         [ 8.0109e-05,  8.0109e-05, -3.1892e-01,  ..., -9.0881e+00,
          -1.0949e+01, -1.1669e+01],
         [-2.0412e-04, -2.0412e-04, -2.4082e+00,  ..., -9.9953e+00,
          -1.0371e+01, -7.6354e+00],
         [-1.8072e-04, -1.8072e-04,  1.3875e+00,  ..., -1.0247e+01,
          -1.3864e+01, -5.6211e+00]]], grad_fn=<DifferentiableGraphBackward>)
vincentqb commented 5 years ago

For 3, can you give a more complete code snippet? It looks like a variable (e.g. epsilon or strided_input) has not been moved to the GPU, see for example here and here:

strided_input = strided_input.to("cuda")
epsilon = epsilon.to("cuda")
vincentqb commented 5 years ago

For 5, I'm assuming you mean performance=runtime. Can you provide comparison/measurement? Maybe even code snippet for this?

HsunGong commented 4 years ago

Let's start with 1. I don't see the fluctuations that you mention. I'm using this file.

import soundfile as sf
import torch
import torchaudio

filename = "sample.wav"
waveform, sample_rate = sf.read(filename)
waveform = torch.from_numpy(waveform).mean(-1).unsqueeze(0).float()

mfcc = torchaudio.transforms.MFCC()

mfcc(waveform)
mfcc(waveform)
mfcc(waveform)
mfcc(waveform)
mfcc(waveform)

I get the same output after each call mfcc(waveform).

tensor([[[-7.3096e+02, -7.3096e+02, -7.1735e+02,  ..., -5.1697e+02,
          -5.3175e+02, -5.0290e+02],
         [ 5.2458e-06,  5.2458e-06,  1.3515e+01,  ...,  1.3600e+02,
           1.2202e+02,  1.3729e+02],
         [-7.4861e-05, -7.4861e-05,  1.3204e+01,  ...,  9.8696e+00,
           4.6198e+00,  7.2932e+00],
         ...,
         [ 8.0109e-05,  8.0109e-05, -3.1892e-01,  ..., -9.0881e+00,
          -1.0949e+01, -1.1669e+01],
         [-2.0412e-04, -2.0412e-04, -2.4082e+00,  ..., -9.9953e+00,
          -1.0371e+01, -7.6354e+00],
         [-1.8072e-04, -1.8072e-04,  1.3875e+00,  ..., -1.0247e+01,
          -1.3864e+01, -5.6211e+00]]], grad_fn=<DifferentiableGraphBackward>)

The results in torchaudio.transform remain the same, however, results in torchaudio.compliance.kaldi changes after different executions. The problem is due to https://github.com/pytorch/audio/issues/263#issue-488758214

HsunGong commented 4 years ago

https://github.com/pytorch/audio/pull/228#issue-306506253 maybe have some problems.

HsunGong commented 4 years ago
tensor([[ 18.3451, -14.9228, -18.3650,  ...,  -6.3634,   1.8486,  -8.7933],
        [ 20.3241, -11.0113, -16.5513,  ..., -10.2232,  -2.2399, -13.0235],
        [ 22.7283,   7.1456, -32.8585,  ..., -14.6573, -22.6597,  -8.8397],
        ...,
        [ 15.3196, -18.4401,   3.7009,  ..., -20.2965,   7.2492,  -6.4407],
        [ 15.3898, -19.9645,  -5.6626,  ..., -12.0694,   5.0258, -15.5496],
        [ 14.6091, -23.0545,  -7.0441,  ..., -31.6500,  -7.9601,  -8.3047]]) torch.Size([7249, 13])
torch.Size([1, 580080]) 8000
tensor([[-18.8921, -14.9228, -18.3650,  ...,  -6.3634,   1.8486,  -8.7933],
        [ -8.8658, -11.0113, -16.5513,  ..., -10.2232,  -2.2399, -13.0235],
        [ -2.7176,   7.1456, -32.8585,  ..., -14.6574, -22.6597,  -8.8396],
        ...,
        [-33.5852, -18.4401,   3.7009,  ..., -20.2965,   7.2492,  -6.4408],
        [-33.6183, -19.9645,  -5.6626,  ..., -12.0694,   5.0258, -15.5496],
        [-36.0048, -23.0545,  -7.0441,  ..., -31.6500,  -7.9601,  -8.3048]]) torch.Size([7249, 13])
.

Find out: if set dither == 0., then the results fixed.

Comment closed!