Problems with Kaldi MFCCs

mravanelli commented 4 years ago

Hi, thank you very much for this very useful project.

I started doing some speech recognition experiments with the MFCC features implemented in torchaudio. In particular, I tried the librosa ones implemented in torchaudio/transorms.py and the kaldi-ones implemented in torchaudio/compliance/kaldi.py.

The librosa features are computed very efficiently and I can achieve results similar to that of the original kaldi features when changing some hyperparameters (i.e, n_mfcc=13, hop_length=160,n_mels=23,f_min=80,f_max=7900).
- When switching to the kaldi implemented features, however, my neural network doesn't even converge. I suspect there a bug somewhere. I tried to compare the original kaldi mfccs with the ones implemented in torchaudio and they look very different (dithering only cannot explain such a big difference):

mfcc_original
array([35.84189 , 39.748493, 35.40782 , 33.237488, 34.53969 , 35.40782 ,
       34.973755, 35.40782 , 35.40782 , 35.84189 ], dtype=float32)

mfcc_torch
tensor([29.3794, 29.1657, 28.7020, 27.4892, 29.1944, 27.8915, 29.3321, 28.8958, 28.4197,
29.0967])

The other issue is that the current version doesn't support cuda and it can only process up to two-channels at a time. Also, the current version is significantly slower than the librosa implementation (there could be a bottleneck somewhere).

Any idea? Hope my feedback would be helpful

Thank you

Mirco

vincentqb commented 4 years ago

If I understand, you are comparing four versions of mfcc:

mfcc from torchaudio/transforms.py
mfcc from librosa
mfcc from torchaudio/compliance/kaldi.py
mfcc from kaldi

You are saying that

1 and 4 are performing well and agree,
3 does not converge and is very different from 4. Is that correct?

Do you have a minimal code I could take a look at ?

HsunGong commented 4 years ago

Quote to similar question https://github.com/pytorch/audio/issues/263#issue-488758214

HsunGong commented 4 years ago

code I could take a look at ?

Here is my example of 3 and 4:

import torchaudio
import torch,numpy,random
random.seed(0)
numpy.random.seed(0)
torch.manual_seed(0)
torch.cuda.manual_seed(0)
torch.cuda.manual_seed_all(0)

# compute-mfcc-feats --verbose=2 --sample-frequency=8000  scp:data/wav.scp ark:- | copy-feats ark:- ark,scp:data/feats.ark,data/feats.scp
d = { u:d for u,d in torchaudio.kaldi_io.read_mat_scp('data/feats.scp') }
kaldi_feats = d['iaaa']
print(kaldi_feats, kaldi_feats.shape)

wav, rate = torchaudio.load('data/wav/iaaa.wav')
print(wav.shape, rate)

torch_feats = torchaudio.compliance.kaldi.mfcc(wav, sample_frequency=rate)
print(torch_feats, torch_feats.shape)
torch_feats = torchaudio.compliance.kaldi.mfcc(wav, sample_frequency=rate)
print(torch_feats, torch_feats.shape)

tensor([[ 18.3451, -14.9193, -18.3694,  ...,  -6.3691,   1.8752,  -8.8333],
        [ 20.3241, -11.0107, -16.5517,  ..., -10.2303,  -2.2465, -13.0228],
        [ 22.7282,   7.1452, -32.8558,  ..., -14.6897, -22.6369,  -8.8484],
        ...,
        [ 15.3191, -18.4647,   3.6274,  ..., -20.1052,   7.0780,  -6.4834],
        [ 15.3900, -19.9616,  -5.4611,  ..., -12.0642,   4.8870, -15.5243],
        [ 14.6114, -23.1458,  -7.1615,  ..., -31.9867,  -8.1553,  -8.3250]]) torch.Size([7249, 13])
torch.Size([1, 580080]) 8000
tensor([[ 25.4531, -28.9004,  -9.2195,  ...,   9.3991,   6.6678,  -0.3100],
        [ 23.8064, -26.9535,  -8.3300,  ...,  -8.8944,  -4.4637,   8.1744],
        [ 25.2671, -25.2465,  -9.5173,  ...,   1.7179,   5.4729,  -7.5934],
        ...,
        [ 23.7336, -30.1332,  -8.1190,  ..., -13.0223,  -1.7747,   5.4382],
        [ 24.7677, -29.2519,  -9.6620,  ...,  -0.6424,  -4.6334,  -8.3185],
        [ 23.9241, -31.0664,  -8.8748,  ...,  -3.3450,   2.4832,   3.8635]]) torch.Size([7249, 13])
tensor([[ 24.8688, -29.7277, -10.0829,  ...,   0.2335,  -5.1891, -10.5182],
        [ 23.7165, -29.3053, -11.0154,  ...,   8.8459,   4.9695,   1.7033],
        [ 24.3918, -30.2426, -17.2043,  ...,  -1.0753,  -3.7638,   1.8900],
        ...,
        [ 23.8795, -28.2105, -13.3643,  ...,  -0.4222,  -6.8063,   3.2779],
        [ 25.2789, -27.4087,  -4.5631,  ...,  -3.4745,   8.7959,   4.0152],
        [ 25.1426, -30.7162, -10.8394,  ..., -19.6604,  -1.2420,   2.3714]]) torch.Size([7249, 13])

Kaldi are tensor 1 torch.kaldi are tensor 2 torch.kaldi again are tensor 3

All different

vincentqb commented 4 years ago

torch_feats = torchaudio.compliance.kaldi.mfcc(wav, sample_frequency=rate)
print(torch_feats, torch_feats.shape)
torch_feats = torchaudio.compliance.kaldi.mfcc(wav, sample_frequency=rate)
print(torch_feats, torch_feats.shape)

Have you tried setting dither=0. in mfcc's call? See #371.

pablomainar commented 4 years ago

Hi, I also have problems with the MFCCs. When I compare the MFCCs generated by Kaldi with the ones generated by PyTorch, I get very similar results for all the coefficients except for the first one. For some reason, there is a difference between the kaldi and pytorch first coefficient of about 100. Still the pattern is the same, so I suspect that there is some kind of energy normalization going on.

For reproducibility, in kaldi I use the s5 recipe of the librispeech example. I use the MFCCs generated at stage 6. The mfcc.conf file only has the use-energy flag set to false, all the other parameters of compute-mfcc-feats are default. The audio from the images below is the 1089-134686-0000.flac from test_clean set.

To compare the features I use kaldiio to convert kaldi features into numpy:

import kaldiio
import numpy as np
path_feat = 'ark:raw_mfcc_test_clean.1.ark'
with kaldiio.ReadHelper(path_feat) as reader:
    for key,kaldi_feats in reader:
        break
    kaldi_feats = np.transpose(kaldi_feats)

For pytorch's features:

import torchaudio
path_audio = 'LibriSpeech/test-clean/1089/134686/1089-134686-0000.flac'
audio_tensor,sr = torchaudio.load(path_audio)
torch_feats = torchaudio.compliance.kaldi.mfcc(waveform=audio_tensor,dither=0)
torch_feats = np.transpose(torch_feats.numpy())

I set dither to 0 according to #157. If dither is a low number (0.1) the features are still very similar, but if dither is 1 then they are completely different, but this a separate problem.

kaldi_feats pytorch

Why do I get this difference in the first coefficients? Is there some normalization step done inside pytorch's compliance library that is not done in kaldi?

pablomainar commented 4 years ago

Update: I have gone back to the spectrogram level trying to find the bug. If I set the flag subtract_mean to true in both kaldi and pytorch, the resulting spectrogram is (almost) the same. But if I set is as false (which is default), the results are different: they have the same pattern but the mean is different.

Kaldi code to generate spectrograms with mean subtraction: ~/kaldi/src/featbin/compute-spectrogram-feats --subtract-mean=true --dither=0.0 --energy-floor=1.0 scp,p:wav.scp ark:generated_feats/spec.ark

Pytorch code to generate spectrograms with mean subtraction: torch_feats = torchaudio_local.spectrogram(waveform=audio_tensor,dither=0,energy_floor=1.0,subtract_mean=True)

Result: kaldi_feats torch_feats

Kaldi code to generate spectrograms without mean subtraction: ~/kaldi/src/featbin/compute-spectrogram-feats --subtract-mean=false --dither=0.0 --energy-floor=1.0 scp,p:wav.scp ark:generated_feats/spec.ark

Pytorch code to generate spectrograms with mean subtraction: torch_feats = torchaudio_local.spectrogram(waveform=audio_tensor,dither=0,energy_floor=1.0,subtract_mean=False)

Result: kaldi_feats_nonsub torch_feats_nonsub

I suspect that there is something on the FFT computation that is normalizing in one but not in other. Any thoughts?

nmfisher commented 3 years ago

VAD_demo.zip

Also, torchaudio.compliance.kaldi.mfcc doesn't produce the exact output as compute-mfcc-feats when htk_compat is False.

When true:

import torchaudio
import torch
import numpy as np

import librosa
wave_file = 'VAD_demo.wav'

audio, sample_rate = librosa.load(wave_file, sr=16000)

torchaudio.compliance.kaldi.mfcc( 
    waveform=torch.Tensor(torch.Tensor(audio).unsqueeze(0)),
    frame_length=100, 
    frame_shift=10, 
    num_ceps= 64,
    num_mel_bins= 64,
    htk_compat=True,
    snip_edges=False)

tensor([[ 1.5239e+01,  4.0710e+00, -9.1076e+00,  ..., -1.3214e+00,
         -3.4604e+00, -7.5218e+01],
        [ 1.5456e+01,  3.7524e+00, -9.5476e+00,  ..., -7.0259e-01,
         -2.7650e+00, -7.4910e+01],
        [ 1.5704e+01,  3.8294e+00, -9.3671e+00,  ..., -1.4487e-01,
         -1.7066e+00, -7.4704e+01],
        ...,
        [ 8.0927e+00, -7.5448e+00, -8.8725e+00,  ...,  7.1561e-01,
         -1.3807e+00, -6.7251e+01],
        [ 8.0175e+00, -8.2607e+00, -1.1727e+01,  ...,  2.2736e-02,
         -2.0510e+00, -6.9318e+01],
        [ 7.5882e+00, -8.9392e+00, -1.3061e+01,  ..., -9.1568e-01,
         -2.5486e+00, -7.0623e+01]])

and

compute-mfcc-feats --frame-length=100 --frame-shift=10 --num-ceps=64 --num-mel-bins=64 --snip-edges=false --htk-compat=true --dither=0 --energy-floor=1 scp:wav.scp ark,t:-
WARNING (compute-mfcc-feats[5.5.689~5-2f3d]:Read():wave-reader.cc:260) Expected 95516 bytes in RIFF chunk, but after first data block there will be 36 + 95280 bytes (we do not support reading multiple data chunks).
wav  [
  15.23852 4.070992 -9.107697 9.433752 -9.323326 15.62674 -9.407998 0.7214716 0.9599227 -5.44089 -11.26766 -6.221908 1.175821 -0.5180629 -2.197447 -8.726407 -1.848569 3.453872 4.1832 -2.637658 -0.2607065 -0.1641882 -0.0722182 -1.466432 -0.9918076 -0.1467309 4.546407 4.537716 -1.037911 6.192455 3.814458 3.485815 13.67851 7.93261 0.2159225 -2.56819 -12.5645 -7.210853 1.150025 0.793225 -0.577466 -1.334647 -0.0593026 0.081699 -0.9379031 3.676176 1.271123 2.297236 5.066513 0.2240368 -0.7410111 -2.629601 3.790086 9.951916 4.806147 3.494368 1.054438 6.064842 1.139308 -4.126108 0.1679086 -1.321496 -3.460518 17.38356
  15.45577 3.752458 -9.547602 10.18087 -9.88015 15.14798 -10.30561 0.4286905 -1.160449 -5.356637 -13.52929 -6.287968 -0.5804175 0.5506579 -1.780558 -5.887392 -1.011132 3.633945 3.231638 -3.511573 -0.1174743 -0.03293979 -0.09964671 -1.642807 -0.7558402 -0.1005801 3.603968 4.117401 -1.777455 4.002991 2.848278 4.0697 13.47057 5.628294 -0.7281701 -2.973053 -10.5222 -5.910959 1.049334 -0.4542561 -1.303142 -0.8375537 -0.03831673 0.02188638 -0.7784259 2.942248 0.9490441 2.425136 4.28632 -0.1520254 -0.2199809 -2.220483 1.599042 6.40525 2.825276 4.396042 3.431517 6.499466 1.625714 -3.128053 1.206802 -0.7026786 -2.765082 17.25393
...

they match, but set htk_compat to false:

torchaudio.compliance.kaldi.mfcc( 
    waveform=torch.Tensor(torch.Tensor(audio).unsqueeze(0)),
    frame_length=100, 
    frame_shift=10, 
    num_ceps= 64,
    num_mel_bins= 64,
    htk_compat=False,
    dither=0,
    energy_floor=1,
    snip_edges=False)
tensor([[-5.3187e+01,  1.5239e+01,  4.0710e+00,  ...,  1.6768e-01,
         -1.3214e+00, -3.4604e+00],
        [-5.2969e+01,  1.5456e+01,  3.7524e+00,  ...,  1.2066e+00,
         -7.0259e-01, -2.7650e+00],
        [-5.2823e+01,  1.5704e+01,  3.8294e+00,  ...,  2.3663e+00,
         -1.4487e-01, -1.7066e+00],
        ...,
        [-4.7554e+01,  8.0927e+00, -7.5448e+00,  ..., -6.1891e+00,
          7.1561e-01, -1.3807e+00],
        [-4.9015e+01,  8.0175e+00, -8.2607e+00,  ..., -6.8149e+00,
          2.2736e-02, -2.0510e+00],
        [-4.9938e+01,  7.5882e+00, -8.9392e+00,  ..., -7.7555e+00,
         -9.1568e-01, -2.5486e+00]])

compute-mfcc-feats --frame-length=100 --frame-shift=10 --num-ceps=64 --num-mel-bins=64 --snip-edges=false --htk-compat=false --dither=0 --energy-floor=1 scp:wav.scp ark,t:-
WARNING (compute-mfcc-feats[5.5.689~5-2f3d]:Read():wave-reader.cc:260) Expected 95516 bytes in RIFF chunk, but after first data block there will be 36 + 95280 bytes (we do not support reading multiple data chunks).
wav  [
  17.38356 15.23852 4.070992 -9.107697 9.433752 -9.323326 15.62674 -9.407998 0.7214716 0.9599227 -5.44089 -11.26766 -6.221908 1.175821 -0.5180629 -2.197447 -8.726407 -1.848569 3.453872 4.1832 -2.637658 -0.2607065 -0.1641882 -0.0722182 -1.466432 -0.9918076 -0.1467309 4.546407 4.537716 -1.037911 6.192455 3.814458 3.485815 13.67851 7.93261 0.2159225 -2.56819 -12.5645 -7.210853 1.150025 0.793225 -0.577466 -1.334647 -0.0593026 0.081699 -0.9379031 3.676176 1.271123 2.297236 5.066513 0.2240368 -0.7410111 -2.629601 3.790086 9.951916 4.806147 3.494368 1.054438 6.064842 1.139308 -4.126108 0.1679086 -1.321496 -3.460518
  17.25393 15.45577 3.752458 -9.547602 10.18087 -9.88015 15.14798 -10.30561 0.4286905 -1.160449 -5.356637 -13.52929 -6.287968 -0.5804175 0.5506579 -1.780558 -5.887392 -1.011132 3.633945 3.231638 -3.511573 -0.1174743 -0.03293979 -0.09964671 -1.642807 -0.7558402 -0.1005801 3.603968 4.117401 -1.777455 4.002991 2.848278 4.0697 13.47057 5.628294 -0.7281701 -2.973053 -10.5222 -5.910959 1.049334 -0.4542561 -1.303142 -0.8375537 -0.03831673 0.02188638 -0.7784259 2.942248 0.9490441 2.425136 4.28632 -0.1520254 -0.2199809 -2.220483 1.599042 6.40525 2.825276 4.396042 3.431517 6.499466 1.625714 -3.128053 1.206802 -0.7026786 -2.765082
...

Also I assume the default dither and energy_floor parameters don't match either, if they're not explicitly set to 0/1 respectively, the results also differ.

pytorch / audio

Problems with Kaldi MFCCs #328