pytorch / audio

Data manipulation and transformation for audio signal processing, powered by PyTorch
https://pytorch.org/audio
BSD 2-Clause "Simplified" License
2.54k stars 651 forks source link

Fbank features are different from Kaldi Fbank #400

Open jooan84 opened 4 years ago

jooan84 commented 4 years ago

🐛 Bug

The output of the fbank feature calculations differs from that of kaldi.

To Reproduce

Steps to reproduce the behavior:

using the following or even the defaults parameters:

 torchaudio.compliance.kaldi.fbank(waveform, blackman_coeff=0.42, channel=-1, dither=1.0, energy_floor=0.0, frame_length=25.0, frame_shift=10.0, high_freq=0.0, htk_compat=True, low_freq=20.0, min_duration=0.0, num_mel_bins=40, preemphasis_coefficient=0.97, raw_energy=True, remove_dc_offset=True, round_to_power_of_two=True, sample_frequency=16000.0, snip_edges=True, subtract_mean=False, use_energy=False, use_log_fbank=True,use_power=True, vtln_high=-500.0, vtln_low=100.0, vtln_warp=1.0, window_type='hamming')[0]

produce this output:

tensor([-0.7616, -0.4791,  0.2155,  0.7661,  2.0723,  1.4565,  2.9888,  3.2548,
         1.8460,  3.5807,  3.8290,  4.1785,  4.6776,  4.5801,  5.3610,  4.4910,
         5.1519,  5.3534,  5.2783,  5.6159,  6.0689,  5.5961,  5.8068,  5.0957,
         6.5200,  6.9314,  6.1741,  7.0430,  7.9394,  8.2380,  8.7115,  8.4105,
         8.3154,  8.2186,  7.9444,  8.4468,  8.4293,  8.9476,  9.1008,  9.2495])

with compute_fbank_feats of Kaldi

tensor([12.9911, 12.9795, 12.9127, 13.6171, 13.7416, 15.1579, 15.1996, 14.9468,
        14.1368, 14.8717, 14.8265, 13.8715, 15.2716, 15.0743, 15.2439, 15.3904,
        13.9460, 13.5932, 14.0038, 14.8721, 13.9944, 15.8337, 14.8682, 13.8247,
        15.0769, 15.1141, 15.1482, 14.7864, 13.6259, 14.4092, 14.1771, 13.6139,
        13.8014, 12.5796,  9.1051,  8.3382,  8.3738,  8.7829,  9.2973,  9.4913])
vincentqb commented 4 years ago
mthrok commented 4 years ago

I looked into this and took a while to figure out why.

When you use fbank function, you need to normalize the audio and for that you need to use torchaudio.load_wav function instead of torchaudio.load.

See my test or existing test.

This is extremely subtle.

cpuhrsch commented 4 years ago

@mthrok - should we add documentation about this or otherwise try to prevent this issue coming up again in the future? I'm surprised we have a need for a separate load_wav to begin with.

vincentqb commented 4 years ago

I second @cpuhrsch: I'm also surprised that we torchaudio.load does not work here.

vincentqb commented 4 years ago

I don't believe we should rely on load_wav to fix this issue.

RuABraun commented 3 years ago

edit: After some testing it seems to get the closest match one has to do no normalisation but times by 2**15 ?

@mthrok normalising audio does not help for me, code:

    data, fs = sf.read('/idiap/resource/database/LibriSpeech/train-clean-360/100/121669/100-121669-0000.flac')
    data = to.from_numpy(data).float()
    data /= data.max()
    f = fbank(data.unsqueeze(0), num_mel_bins=40, low_freq=40, high_freq=7600)

    kaldi_feats = None
    for uttid, m in kaldi_io.read_mat_scp('scp:feats.scp'):
        kaldi_feats = m
    print(uttid)
    print(kaldi_feats[:2])
    print(f[:2])

output

    100-121669-0000-1
    [[ 8.129056   7.732553   7.6204824  6.776312   7.437045   8.823427
   8.736998   8.304144   8.411314   8.19662    6.130655   8.646175
   9.119083   9.085771   8.314858   9.277414   9.7172785  9.830122
   9.228786   9.078177   9.063866   9.667826   8.975353   9.46149
   9.655378   9.932469   9.935007  10.056624   9.357061  10.264997
  10.36901   10.563572  10.689384  11.149243  11.518983  10.866757
  10.359279  10.542366  11.021458  10.561819 ]
 [ 8.081877   7.8777122  6.87261    8.406      9.237014   8.542725
   7.0748315  7.555811   8.742043   9.1879     7.651375   7.56339
   8.07299    9.343008   9.155113   9.235215   9.285145   9.729772
   9.2692585  9.870285  10.123455   9.58822    9.321457   9.46149
   9.285657   9.631441  11.042232  10.012186   9.731838   9.504875
  10.895826  10.652676  10.899666  10.996901  10.666897  11.006931
  10.998066  11.225334  11.071218  10.741457 ]]
tensor([[-10.5861, -10.9795, -11.1278, -11.9309, -11.2997,  -9.8805,  -9.8985,
         -10.3205, -10.3428, -10.5305, -12.9941, -10.0486,  -9.5567,  -9.6991,
         -10.3325,  -9.3442,  -8.9814,  -8.8237,  -9.3472,  -9.6113,  -9.7424,
          -8.9508,  -9.7846,  -9.3923,  -8.8430,  -8.8997,  -8.7163,  -8.5314,
          -9.2710,  -8.6714,  -8.3952,  -8.3978,  -8.0870,  -7.5590,  -7.4100,
          -7.9227,  -8.4362,  -8.7195,  -8.0624,  -8.5884],
        [-10.5894, -10.7786, -11.8293, -10.2971,  -9.4618, -10.1934, -11.7973,
         -11.3098, -10.0636,  -9.5083, -10.8814, -11.2168, -10.6213,  -9.4451,
          -9.5788,  -9.5073,  -9.5189,  -8.9797,  -9.5143,  -8.6416,  -8.4359,
          -9.1466,  -9.2892,  -9.3173,  -9.4014,  -9.2642,  -7.6490,  -8.6838,
          -9.0432,  -9.5034,  -7.9339,  -7.9784,  -7.9248,  -7.8987,  -8.2526,
          -8.2896,  -8.0052,  -7.9586,  -8.1519,  -8.1042]])

Also in my opinion if this is an important requirement then the function should check that the max is equal to 1 and warn otherwise.

Btw I don't think it's good to make the assumption of normalising audio as you can't do this in a realtime setting.

mthrok commented 3 years ago

Hi @RuABraun

As you figured out, normalization here means dtype conversion, that is float (with value range [-1, 1]) to int16 (with value range [-32,768, 32,767].

According to my recent talk with @cpuhrsch, this fbank feature is not intended for precise match with the Kaldi's implementation.

I found that our test suite for this function which I thought was covering it was not enough and it does not match the Kaldi's result.

I personally think that it is more confusing to have a module named compliance, which is implicitly not meant to match. Also we are getting rid of load_wav function, so we do need to change things around compliance.kaldi module.

To lower the maintenance cost, I am in favor of building Kaldi and binding, which guarantees all the Kaldi related features to match perfectly with Kaldi's result but that opinion is not getting a support from anyone.

Similar issue is raised at #328

RuABraun commented 3 years ago

Thank you for the explanation! :)

njusq commented 2 years ago

We also had the same problem two days ago under the setting subtrach_mean = False. We compared the results of torchaudio's fbank and kaldi's compute-fbank-feats line-by-line. The differences occured from the values of input. It is really confusing that the input of torchaudio's fbank should be float number in the range of [-32,768., 32,767.] ( not float [-1.,1.] or int16 [-32,768, 32,767]). We fixed the problem by loading one piece of 16-bit .wav with dtype='int16' and converted the signal value to float directly without any normalization. e.g. We converted the value -3 to -3.0. After fixing, the result shows that:

err between kaldi and torchaudio res (1.2798111e-07, 1.177518e-05, 7.390976e-05)
kaldi res: tensor([[ 7.7390,  6.6414,  5.9847,  ..., 11.2153, 10.8115, 10.6624],
        [ 8.3844,  7.3069,  5.6935,  ..., 11.3059, 11.9750, 11.1324],
        [ 5.9230,  4.5791,  6.4441,  ..., 11.5842, 12.4497, 11.9442],
        ...,
        [ 8.3075,  7.4419,  6.3531,  ...,  8.7440,  9.0616,  8.9001],
        [ 7.9940,  7.1240,  4.0873,  ...,  8.4048,  8.6729,  9.1240],
        [ 8.7946,  6.5140,  6.0803,  ...,  8.8812,  8.5578,  8.0560]])

torchaudio res tensor([[ 7.7390,  6.6414,  5.9847,  ..., 11.2153, 10.8115, 10.6624],
        [ 8.3844,  7.3069,  5.6935,  ..., 11.3059, 11.9750, 11.1324],
        [ 5.9230,  4.5791,  6.4441,  ..., 11.5842, 12.4497, 11.9442],
        ...,
        [ 8.3075,  7.4419,  6.3531,  ...,  8.7440,  9.0616,  8.9001],
        [ 7.9940,  7.1240,  4.0873,  ...,  8.4048,  8.6729,  9.1240],
        [ 8.7946,  6.5140,  6.0803,  ...,  8.8812,  8.5578,  8.0560]])
Wonder1905 commented 2 years ago

Can you please share your code? it will be very useful!

mthrok commented 2 years ago

Can you please share your code? it will be very useful!

@BattashB

Something like this

waveform, sample_rarte = torchaudio.load(<file>)  # waveform is float32, value range [-1, 1]
waveform = waveform * (2 << 16) # convert the value range to  [-32,768., 32,767.]