Open speechbrain opened 5 years ago
Let's start with 1. I don't see the fluctuations that you mention. I'm using this file.
import soundfile as sf
import torch
import torchaudio
filename = "sample.wav"
waveform, sample_rate = sf.read(filename)
waveform = torch.from_numpy(waveform).mean(-1).unsqueeze(0).float()
mfcc = torchaudio.transforms.MFCC()
mfcc(waveform)
mfcc(waveform)
mfcc(waveform)
mfcc(waveform)
mfcc(waveform)
I get the same output after each call mfcc(waveform)
.
tensor([[[-7.3096e+02, -7.3096e+02, -7.1735e+02, ..., -5.1697e+02,
-5.3175e+02, -5.0290e+02],
[ 5.2458e-06, 5.2458e-06, 1.3515e+01, ..., 1.3600e+02,
1.2202e+02, 1.3729e+02],
[-7.4861e-05, -7.4861e-05, 1.3204e+01, ..., 9.8696e+00,
4.6198e+00, 7.2932e+00],
...,
[ 8.0109e-05, 8.0109e-05, -3.1892e-01, ..., -9.0881e+00,
-1.0949e+01, -1.1669e+01],
[-2.0412e-04, -2.0412e-04, -2.4082e+00, ..., -9.9953e+00,
-1.0371e+01, -7.6354e+00],
[-1.8072e-04, -1.8072e-04, 1.3875e+00, ..., -1.0247e+01,
-1.3864e+01, -5.6211e+00]]], grad_fn=<DifferentiableGraphBackward>)
For 5, I'm assuming you mean performance=runtime. Can you provide comparison/measurement? Maybe even code snippet for this?
Let's start with 1. I don't see the fluctuations that you mention. I'm using this file.
import soundfile as sf import torch import torchaudio filename = "sample.wav" waveform, sample_rate = sf.read(filename) waveform = torch.from_numpy(waveform).mean(-1).unsqueeze(0).float() mfcc = torchaudio.transforms.MFCC() mfcc(waveform) mfcc(waveform) mfcc(waveform) mfcc(waveform) mfcc(waveform)
I get the same output after each call
mfcc(waveform)
.tensor([[[-7.3096e+02, -7.3096e+02, -7.1735e+02, ..., -5.1697e+02, -5.3175e+02, -5.0290e+02], [ 5.2458e-06, 5.2458e-06, 1.3515e+01, ..., 1.3600e+02, 1.2202e+02, 1.3729e+02], [-7.4861e-05, -7.4861e-05, 1.3204e+01, ..., 9.8696e+00, 4.6198e+00, 7.2932e+00], ..., [ 8.0109e-05, 8.0109e-05, -3.1892e-01, ..., -9.0881e+00, -1.0949e+01, -1.1669e+01], [-2.0412e-04, -2.0412e-04, -2.4082e+00, ..., -9.9953e+00, -1.0371e+01, -7.6354e+00], [-1.8072e-04, -1.8072e-04, 1.3875e+00, ..., -1.0247e+01, -1.3864e+01, -5.6211e+00]]], grad_fn=<DifferentiableGraphBackward>)
The results in torchaudio.transform
remain the same, however, results in torchaudio.compliance.kaldi
changes after different executions. The problem is due to https://github.com/pytorch/audio/issues/263#issue-488758214
https://github.com/pytorch/audio/pull/228#issue-306506253 maybe have some problems.
tensor([[ 18.3451, -14.9228, -18.3650, ..., -6.3634, 1.8486, -8.7933],
[ 20.3241, -11.0113, -16.5513, ..., -10.2232, -2.2399, -13.0235],
[ 22.7283, 7.1456, -32.8585, ..., -14.6573, -22.6597, -8.8397],
...,
[ 15.3196, -18.4401, 3.7009, ..., -20.2965, 7.2492, -6.4407],
[ 15.3898, -19.9645, -5.6626, ..., -12.0694, 5.0258, -15.5496],
[ 14.6091, -23.0545, -7.0441, ..., -31.6500, -7.9601, -8.3047]]) torch.Size([7249, 13])
torch.Size([1, 580080]) 8000
tensor([[-18.8921, -14.9228, -18.3650, ..., -6.3634, 1.8486, -8.7933],
[ -8.8658, -11.0113, -16.5513, ..., -10.2232, -2.2399, -13.0235],
[ -2.7176, 7.1456, -32.8585, ..., -14.6574, -22.6597, -8.8396],
...,
[-33.5852, -18.4401, 3.7009, ..., -20.2965, 7.2492, -6.4408],
[-33.6183, -19.9645, -5.6626, ..., -12.0694, 5.0258, -15.5496],
[-36.0048, -23.0545, -7.0441, ..., -31.6500, -7.9601, -8.3048]]) torch.Size([7249, 13])
.
Find out: if set dither == 0.
, then the results fixed.
Comment closed!
Hi, I'm trying to do some experiments with the kaldi-compliant MFCCs, but I run into some possible issues:
1- When I run the following code
The mfccs are different every time I run the script:
run 1
run 2
This is due to the dithering, that is a type of noise. The problem can be easily solved by setting a manual_seed before executing the code. To avoid issues, the users should be aware of that. Maybe it could be great to provide an example in the documentation.
2- Even if I remove the dithering in both kaldi and torchaudio mfccs, the two vectors are very very different (the options in the two cases are exactly the same):t
torch.audio
kaldi
3- When I put the tensor into the GPU (with .to('cuda')), I have the following error:
log_energy = torch.max(strided_input.pow(2).sum(1), epsilon).log() # size (m)RuntimeError: Expected object of backend CUDA but got backend CPU for argument #2 'other'
How can I compute the MFCCs using cuda?
4- A good thing is that the execution time (even on a single cpu only) is compatible with the kaldi one (around 15 second for the entire TIMIT dataset). I would expect a further speed up on the GPU.
5- I tried to feed the MFCCs coefficient inside a standard speech recognizer and the performance with the original kaldi coefficients is still much better. Do you have the same experience?
Thank you and thanks for developing this very useful toolkit!