Open mravanelli opened 5 years ago
If I understand, you are comparing four versions of mfcc:
torchaudio/transforms.py
torchaudio/compliance/kaldi.py
You are saying that
Do you have a minimal code I could take a look at ?
Quote to similar question https://github.com/pytorch/audio/issues/263#issue-488758214
code I could take a look at ?
Here is my example of 3
and 4
:
import torchaudio
import torch,numpy,random
random.seed(0)
numpy.random.seed(0)
torch.manual_seed(0)
torch.cuda.manual_seed(0)
torch.cuda.manual_seed_all(0)
# compute-mfcc-feats --verbose=2 --sample-frequency=8000 scp:data/wav.scp ark:- | copy-feats ark:- ark,scp:data/feats.ark,data/feats.scp
d = { u:d for u,d in torchaudio.kaldi_io.read_mat_scp('data/feats.scp') }
kaldi_feats = d['iaaa']
print(kaldi_feats, kaldi_feats.shape)
wav, rate = torchaudio.load('data/wav/iaaa.wav')
print(wav.shape, rate)
torch_feats = torchaudio.compliance.kaldi.mfcc(wav, sample_frequency=rate)
print(torch_feats, torch_feats.shape)
torch_feats = torchaudio.compliance.kaldi.mfcc(wav, sample_frequency=rate)
print(torch_feats, torch_feats.shape)
tensor([[ 18.3451, -14.9193, -18.3694, ..., -6.3691, 1.8752, -8.8333],
[ 20.3241, -11.0107, -16.5517, ..., -10.2303, -2.2465, -13.0228],
[ 22.7282, 7.1452, -32.8558, ..., -14.6897, -22.6369, -8.8484],
...,
[ 15.3191, -18.4647, 3.6274, ..., -20.1052, 7.0780, -6.4834],
[ 15.3900, -19.9616, -5.4611, ..., -12.0642, 4.8870, -15.5243],
[ 14.6114, -23.1458, -7.1615, ..., -31.9867, -8.1553, -8.3250]]) torch.Size([7249, 13])
torch.Size([1, 580080]) 8000
tensor([[ 25.4531, -28.9004, -9.2195, ..., 9.3991, 6.6678, -0.3100],
[ 23.8064, -26.9535, -8.3300, ..., -8.8944, -4.4637, 8.1744],
[ 25.2671, -25.2465, -9.5173, ..., 1.7179, 5.4729, -7.5934],
...,
[ 23.7336, -30.1332, -8.1190, ..., -13.0223, -1.7747, 5.4382],
[ 24.7677, -29.2519, -9.6620, ..., -0.6424, -4.6334, -8.3185],
[ 23.9241, -31.0664, -8.8748, ..., -3.3450, 2.4832, 3.8635]]) torch.Size([7249, 13])
tensor([[ 24.8688, -29.7277, -10.0829, ..., 0.2335, -5.1891, -10.5182],
[ 23.7165, -29.3053, -11.0154, ..., 8.8459, 4.9695, 1.7033],
[ 24.3918, -30.2426, -17.2043, ..., -1.0753, -3.7638, 1.8900],
...,
[ 23.8795, -28.2105, -13.3643, ..., -0.4222, -6.8063, 3.2779],
[ 25.2789, -27.4087, -4.5631, ..., -3.4745, 8.7959, 4.0152],
[ 25.1426, -30.7162, -10.8394, ..., -19.6604, -1.2420, 2.3714]]) torch.Size([7249, 13])
Kaldi are tensor 1
torch.kaldi are tensor 2
torch.kaldi again are tensor 3
All different
torch_feats = torchaudio.compliance.kaldi.mfcc(wav, sample_frequency=rate) print(torch_feats, torch_feats.shape) torch_feats = torchaudio.compliance.kaldi.mfcc(wav, sample_frequency=rate) print(torch_feats, torch_feats.shape)
Have you tried setting dither=0.
in mfcc's call? See #371.
Hi, I also have problems with the MFCCs. When I compare the MFCCs generated by Kaldi with the ones generated by PyTorch, I get very similar results for all the coefficients except for the first one. For some reason, there is a difference between the kaldi and pytorch first coefficient of about 100. Still the pattern is the same, so I suspect that there is some kind of energy normalization going on.
For reproducibility, in kaldi I use the s5 recipe of the librispeech example. I use the MFCCs generated at stage 6. The mfcc.conf file only has the use-energy flag set to false, all the other parameters of compute-mfcc-feats are default. The audio from the images below is the 1089-134686-0000.flac from test_clean set.
To compare the features I use kaldiio to convert kaldi features into numpy:
import kaldiio
import numpy as np
path_feat = 'ark:raw_mfcc_test_clean.1.ark'
with kaldiio.ReadHelper(path_feat) as reader:
for key,kaldi_feats in reader:
break
kaldi_feats = np.transpose(kaldi_feats)
For pytorch's features:
import torchaudio
path_audio = 'LibriSpeech/test-clean/1089/134686/1089-134686-0000.flac'
audio_tensor,sr = torchaudio.load(path_audio)
torch_feats = torchaudio.compliance.kaldi.mfcc(waveform=audio_tensor,dither=0)
torch_feats = np.transpose(torch_feats.numpy())
I set dither to 0 according to #157. If dither is a low number (0.1) the features are still very similar, but if dither is 1 then they are completely different, but this a separate problem.
Why do I get this difference in the first coefficients? Is there some normalization step done inside pytorch's compliance library that is not done in kaldi?
Update: I have gone back to the spectrogram level trying to find the bug. If I set the flag subtract_mean to true in both kaldi and pytorch, the resulting spectrogram is (almost) the same. But if I set is as false (which is default), the results are different: they have the same pattern but the mean is different.
Kaldi code to generate spectrograms with mean subtraction:
~/kaldi/src/featbin/compute-spectrogram-feats --subtract-mean=true --dither=0.0 --energy-floor=1.0 scp,p:wav.scp ark:generated_feats/spec.ark
Pytorch code to generate spectrograms with mean subtraction:
torch_feats = torchaudio_local.spectrogram(waveform=audio_tensor,dither=0,energy_floor=1.0,subtract_mean=True)
Result:
Kaldi code to generate spectrograms without mean subtraction:
~/kaldi/src/featbin/compute-spectrogram-feats --subtract-mean=false --dither=0.0 --energy-floor=1.0 scp,p:wav.scp ark:generated_feats/spec.ark
Pytorch code to generate spectrograms with mean subtraction:
torch_feats = torchaudio_local.spectrogram(waveform=audio_tensor,dither=0,energy_floor=1.0,subtract_mean=False)
Result:
I suspect that there is something on the FFT computation that is normalizing in one but not in other. Any thoughts?
Also, torchaudio.compliance.kaldi.mfcc doesn't produce the exact output as compute-mfcc-feats when htk_compat is False.
When true:
import torchaudio
import torch
import numpy as np
import librosa
wave_file = 'VAD_demo.wav'
audio, sample_rate = librosa.load(wave_file, sr=16000)
torchaudio.compliance.kaldi.mfcc(
waveform=torch.Tensor(torch.Tensor(audio).unsqueeze(0)),
frame_length=100,
frame_shift=10,
num_ceps= 64,
num_mel_bins= 64,
htk_compat=True,
snip_edges=False)
tensor([[ 1.5239e+01, 4.0710e+00, -9.1076e+00, ..., -1.3214e+00,
-3.4604e+00, -7.5218e+01],
[ 1.5456e+01, 3.7524e+00, -9.5476e+00, ..., -7.0259e-01,
-2.7650e+00, -7.4910e+01],
[ 1.5704e+01, 3.8294e+00, -9.3671e+00, ..., -1.4487e-01,
-1.7066e+00, -7.4704e+01],
...,
[ 8.0927e+00, -7.5448e+00, -8.8725e+00, ..., 7.1561e-01,
-1.3807e+00, -6.7251e+01],
[ 8.0175e+00, -8.2607e+00, -1.1727e+01, ..., 2.2736e-02,
-2.0510e+00, -6.9318e+01],
[ 7.5882e+00, -8.9392e+00, -1.3061e+01, ..., -9.1568e-01,
-2.5486e+00, -7.0623e+01]])
and
compute-mfcc-feats --frame-length=100 --frame-shift=10 --num-ceps=64 --num-mel-bins=64 --snip-edges=false --htk-compat=true --dither=0 --energy-floor=1 scp:wav.scp ark,t:-
WARNING (compute-mfcc-feats[5.5.689~5-2f3d]:Read():wave-reader.cc:260) Expected 95516 bytes in RIFF chunk, but after first data block there will be 36 + 95280 bytes (we do not support reading multiple data chunks).
wav [
15.23852 4.070992 -9.107697 9.433752 -9.323326 15.62674 -9.407998 0.7214716 0.9599227 -5.44089 -11.26766 -6.221908 1.175821 -0.5180629 -2.197447 -8.726407 -1.848569 3.453872 4.1832 -2.637658 -0.2607065 -0.1641882 -0.0722182 -1.466432 -0.9918076 -0.1467309 4.546407 4.537716 -1.037911 6.192455 3.814458 3.485815 13.67851 7.93261 0.2159225 -2.56819 -12.5645 -7.210853 1.150025 0.793225 -0.577466 -1.334647 -0.0593026 0.081699 -0.9379031 3.676176 1.271123 2.297236 5.066513 0.2240368 -0.7410111 -2.629601 3.790086 9.951916 4.806147 3.494368 1.054438 6.064842 1.139308 -4.126108 0.1679086 -1.321496 -3.460518 17.38356
15.45577 3.752458 -9.547602 10.18087 -9.88015 15.14798 -10.30561 0.4286905 -1.160449 -5.356637 -13.52929 -6.287968 -0.5804175 0.5506579 -1.780558 -5.887392 -1.011132 3.633945 3.231638 -3.511573 -0.1174743 -0.03293979 -0.09964671 -1.642807 -0.7558402 -0.1005801 3.603968 4.117401 -1.777455 4.002991 2.848278 4.0697 13.47057 5.628294 -0.7281701 -2.973053 -10.5222 -5.910959 1.049334 -0.4542561 -1.303142 -0.8375537 -0.03831673 0.02188638 -0.7784259 2.942248 0.9490441 2.425136 4.28632 -0.1520254 -0.2199809 -2.220483 1.599042 6.40525 2.825276 4.396042 3.431517 6.499466 1.625714 -3.128053 1.206802 -0.7026786 -2.765082 17.25393
...
they match, but set htk_compat to false:
torchaudio.compliance.kaldi.mfcc(
waveform=torch.Tensor(torch.Tensor(audio).unsqueeze(0)),
frame_length=100,
frame_shift=10,
num_ceps= 64,
num_mel_bins= 64,
htk_compat=False,
dither=0,
energy_floor=1,
snip_edges=False)
tensor([[-5.3187e+01, 1.5239e+01, 4.0710e+00, ..., 1.6768e-01,
-1.3214e+00, -3.4604e+00],
[-5.2969e+01, 1.5456e+01, 3.7524e+00, ..., 1.2066e+00,
-7.0259e-01, -2.7650e+00],
[-5.2823e+01, 1.5704e+01, 3.8294e+00, ..., 2.3663e+00,
-1.4487e-01, -1.7066e+00],
...,
[-4.7554e+01, 8.0927e+00, -7.5448e+00, ..., -6.1891e+00,
7.1561e-01, -1.3807e+00],
[-4.9015e+01, 8.0175e+00, -8.2607e+00, ..., -6.8149e+00,
2.2736e-02, -2.0510e+00],
[-4.9938e+01, 7.5882e+00, -8.9392e+00, ..., -7.7555e+00,
-9.1568e-01, -2.5486e+00]])
compute-mfcc-feats --frame-length=100 --frame-shift=10 --num-ceps=64 --num-mel-bins=64 --snip-edges=false --htk-compat=false --dither=0 --energy-floor=1 scp:wav.scp ark,t:-
WARNING (compute-mfcc-feats[5.5.689~5-2f3d]:Read():wave-reader.cc:260) Expected 95516 bytes in RIFF chunk, but after first data block there will be 36 + 95280 bytes (we do not support reading multiple data chunks).
wav [
17.38356 15.23852 4.070992 -9.107697 9.433752 -9.323326 15.62674 -9.407998 0.7214716 0.9599227 -5.44089 -11.26766 -6.221908 1.175821 -0.5180629 -2.197447 -8.726407 -1.848569 3.453872 4.1832 -2.637658 -0.2607065 -0.1641882 -0.0722182 -1.466432 -0.9918076 -0.1467309 4.546407 4.537716 -1.037911 6.192455 3.814458 3.485815 13.67851 7.93261 0.2159225 -2.56819 -12.5645 -7.210853 1.150025 0.793225 -0.577466 -1.334647 -0.0593026 0.081699 -0.9379031 3.676176 1.271123 2.297236 5.066513 0.2240368 -0.7410111 -2.629601 3.790086 9.951916 4.806147 3.494368 1.054438 6.064842 1.139308 -4.126108 0.1679086 -1.321496 -3.460518
17.25393 15.45577 3.752458 -9.547602 10.18087 -9.88015 15.14798 -10.30561 0.4286905 -1.160449 -5.356637 -13.52929 -6.287968 -0.5804175 0.5506579 -1.780558 -5.887392 -1.011132 3.633945 3.231638 -3.511573 -0.1174743 -0.03293979 -0.09964671 -1.642807 -0.7558402 -0.1005801 3.603968 4.117401 -1.777455 4.002991 2.848278 4.0697 13.47057 5.628294 -0.7281701 -2.973053 -10.5222 -5.910959 1.049334 -0.4542561 -1.303142 -0.8375537 -0.03831673 0.02188638 -0.7784259 2.942248 0.9490441 2.425136 4.28632 -0.1520254 -0.2199809 -2.220483 1.599042 6.40525 2.825276 4.396042 3.431517 6.499466 1.625714 -3.128053 1.206802 -0.7026786 -2.765082
...
Also I assume the default dither
and energy_floor
parameters don't match either, if they're not explicitly set to 0/1 respectively, the results also differ.
Hi, thank you very much for this very useful project.
I started doing some speech recognition experiments with the MFCC features implemented in torchaudio. In particular, I tried the librosa ones implemented in torchaudio/transorms.py and the kaldi-ones implemented in torchaudio/compliance/kaldi.py.
The librosa features are computed very efficiently and I can achieve results similar to that of the original kaldi features when changing some hyperparameters (i.e, n_mfcc=13, hop_length=160,n_mels=23,f_min=80,f_max=7900).
The other issue is that the current version doesn't support cuda and it can only process up to two-channels at a time. Also, the current version is significantly slower than the librosa implementation (there could be a bottleneck somewhere).
Any idea? Hope my feedback would be helpful
Thank you
Mirco