Open jooan84 opened 4 years ago
dither=1.0
which adds dither.I looked into this and took a while to figure out why.
When you use fbank
function, you need to normalize the audio and for that you need to use torchaudio.load_wav
function instead of torchaudio.load
.
See my test or existing test.
This is extremely subtle.
@mthrok - should we add documentation about this or otherwise try to prevent this issue coming up again in the future? I'm surprised we have a need for a separate load_wav to begin with.
I second @cpuhrsch: I'm also surprised that we torchaudio.load
does not work here.
I don't believe we should rely on load_wav
to fix this issue.
edit: After some testing it seems to get the closest match one has to do no normalisation but times by 2**15 ?
@mthrok normalising audio does not help for me, code:
data, fs = sf.read('/idiap/resource/database/LibriSpeech/train-clean-360/100/121669/100-121669-0000.flac')
data = to.from_numpy(data).float()
data /= data.max()
f = fbank(data.unsqueeze(0), num_mel_bins=40, low_freq=40, high_freq=7600)
kaldi_feats = None
for uttid, m in kaldi_io.read_mat_scp('scp:feats.scp'):
kaldi_feats = m
print(uttid)
print(kaldi_feats[:2])
print(f[:2])
output
100-121669-0000-1
[[ 8.129056 7.732553 7.6204824 6.776312 7.437045 8.823427
8.736998 8.304144 8.411314 8.19662 6.130655 8.646175
9.119083 9.085771 8.314858 9.277414 9.7172785 9.830122
9.228786 9.078177 9.063866 9.667826 8.975353 9.46149
9.655378 9.932469 9.935007 10.056624 9.357061 10.264997
10.36901 10.563572 10.689384 11.149243 11.518983 10.866757
10.359279 10.542366 11.021458 10.561819 ]
[ 8.081877 7.8777122 6.87261 8.406 9.237014 8.542725
7.0748315 7.555811 8.742043 9.1879 7.651375 7.56339
8.07299 9.343008 9.155113 9.235215 9.285145 9.729772
9.2692585 9.870285 10.123455 9.58822 9.321457 9.46149
9.285657 9.631441 11.042232 10.012186 9.731838 9.504875
10.895826 10.652676 10.899666 10.996901 10.666897 11.006931
10.998066 11.225334 11.071218 10.741457 ]]
tensor([[-10.5861, -10.9795, -11.1278, -11.9309, -11.2997, -9.8805, -9.8985,
-10.3205, -10.3428, -10.5305, -12.9941, -10.0486, -9.5567, -9.6991,
-10.3325, -9.3442, -8.9814, -8.8237, -9.3472, -9.6113, -9.7424,
-8.9508, -9.7846, -9.3923, -8.8430, -8.8997, -8.7163, -8.5314,
-9.2710, -8.6714, -8.3952, -8.3978, -8.0870, -7.5590, -7.4100,
-7.9227, -8.4362, -8.7195, -8.0624, -8.5884],
[-10.5894, -10.7786, -11.8293, -10.2971, -9.4618, -10.1934, -11.7973,
-11.3098, -10.0636, -9.5083, -10.8814, -11.2168, -10.6213, -9.4451,
-9.5788, -9.5073, -9.5189, -8.9797, -9.5143, -8.6416, -8.4359,
-9.1466, -9.2892, -9.3173, -9.4014, -9.2642, -7.6490, -8.6838,
-9.0432, -9.5034, -7.9339, -7.9784, -7.9248, -7.8987, -8.2526,
-8.2896, -8.0052, -7.9586, -8.1519, -8.1042]])
Also in my opinion if this is an important requirement then the function should check that the max is equal to 1 and warn otherwise.
Btw I don't think it's good to make the assumption of normalising audio as you can't do this in a realtime setting.
Hi @RuABraun
As you figured out, normalization here means dtype conversion, that is float
(with value range [-1, 1]) to int16
(with value range [-32,768, 32,767].
According to my recent talk with @cpuhrsch, this fbank feature is not intended for precise match with the Kaldi's implementation.
I found that our test suite for this function which I thought was covering it was not enough and it does not match the Kaldi's result.
I personally think that it is more confusing to have a module named compliance
, which is implicitly not meant to match. Also we are getting rid of load_wav
function, so we do need to change things around compliance.kaldi
module.
To lower the maintenance cost, I am in favor of building Kaldi and binding, which guarantees all the Kaldi related features to match perfectly with Kaldi's result but that opinion is not getting a support from anyone.
Similar issue is raised at #328
Thank you for the explanation! :)
We also had the same problem two days ago under the setting subtrach_mean = False
.
We compared the results of torchaudio's fbank and kaldi's compute-fbank-feats line-by-line. The differences occured from the values of input.
It is really confusing that the input of torchaudio's fbank should be float number in the range of [-32,768., 32,767.] ( not float [-1.,1.] or int16 [-32,768, 32,767]).
We fixed the problem by loading one piece of 16-bit .wav with dtype='int16'
and converted the signal value to float directly without any normalization. e.g. We converted the value -3 to -3.0.
After fixing, the result shows that:
err between kaldi and torchaudio res (1.2798111e-07, 1.177518e-05, 7.390976e-05)
kaldi res: tensor([[ 7.7390, 6.6414, 5.9847, ..., 11.2153, 10.8115, 10.6624],
[ 8.3844, 7.3069, 5.6935, ..., 11.3059, 11.9750, 11.1324],
[ 5.9230, 4.5791, 6.4441, ..., 11.5842, 12.4497, 11.9442],
...,
[ 8.3075, 7.4419, 6.3531, ..., 8.7440, 9.0616, 8.9001],
[ 7.9940, 7.1240, 4.0873, ..., 8.4048, 8.6729, 9.1240],
[ 8.7946, 6.5140, 6.0803, ..., 8.8812, 8.5578, 8.0560]])
torchaudio res tensor([[ 7.7390, 6.6414, 5.9847, ..., 11.2153, 10.8115, 10.6624],
[ 8.3844, 7.3069, 5.6935, ..., 11.3059, 11.9750, 11.1324],
[ 5.9230, 4.5791, 6.4441, ..., 11.5842, 12.4497, 11.9442],
...,
[ 8.3075, 7.4419, 6.3531, ..., 8.7440, 9.0616, 8.9001],
[ 7.9940, 7.1240, 4.0873, ..., 8.4048, 8.6729, 9.1240],
[ 8.7946, 6.5140, 6.0803, ..., 8.8812, 8.5578, 8.0560]])
Can you please share your code? it will be very useful!
Can you please share your code? it will be very useful!
@BattashB
Something like this
waveform, sample_rarte = torchaudio.load(<file>) # waveform is float32, value range [-1, 1]
waveform = waveform * (2 << 16) # convert the value range to [-32,768., 32,767.]
🐛 Bug
The output of the fbank feature calculations differs from that of kaldi.
To Reproduce
Steps to reproduce the behavior:
using the following or even the defaults parameters:
produce this output:
with compute_fbank_feats of Kaldi