pytorch / audio

Data manipulation and transformation for audio signal processing, powered by PyTorch
https://pytorch.org/audio
BSD 2-Clause "Simplified" License
2.48k stars 640 forks source link

Revise parameters for Kaldi mfcc compatibility test #689

Open mthrok opened 4 years ago

mthrok commented 4 years ago

Similar to #679

We should also revise the parameters for mfcc test.

See also #681

engineerchuan commented 4 years ago

I would like to work on this.

mthrok commented 4 years ago

I would like to work on this.

Hi @engineerchuan

Thanks. Do you know what are good parameters for mfcc? I am not expert but we can consult with our collaborators.

engineerchuan commented 4 years ago

Not off top of my head. Let me study it first for a day and come up with a proposal.

engineerchuan commented 4 years ago

Hi @mthrok,

I would like to follow this approach with some questions:

  1. For testing compute-fbank-feats and compute-mfcc-feats, first extract the default argument values.
  2. As suggested in https://github.com/pytorch/audio/issues/679, use example datasets to augment more examples of "valid" kaldi parameters both for fbank and mfcc.

Question 1: How should we store and keep the default values for fbank and mfcc up to date?

Recommendation - cache the default fbank values and the override values in json. In future, revise manually if kaldi default argument values or example datasets default argument values change.

Question 2: How should we handle when some datasets don't have fbank config or don't have mfcc config?

Recommendation - we should only use configs from datasets for testing fbank or mfcc if they have the respective config.

Example: Switchboard has both fbank and mfcc config, thus we will use both for testing.

Example: librispeech only stores mfcc config, thus we will not use librispeech for testing fbank

Question 3: What should we do with generate_fbank_data.py?

Currently generate_fbank_data.py generates random parameters, which may be invalid. We could have it make network wget calls to the relevant repositories if possible to retrieve and parse the values. It could inspect Kaldi source code directory or execute the executable path with --help to parse out default values. This sounds hacky and maybe we should skip it for now.

mthrok commented 4 years ago

Question 1: How should we store and keep the default values for fbank and mfcc up to date?

Recommendation - cache the default fbank values and the override values in json. In future, revise manually if kaldi default argument values or example datasets default argument values change.

I am not quite sure what you mean by cache, but in terms of JSON data, I think providing empty arguments {}, would result in default parameters in both Kaldi CLI and torchaudio's implementation. That way if Kaldi changes default values, I think we can notice. Then we can add arguments with the current default values {"allow_downsample": false, "allow_upsample": false, ... }. I think the later is what you mean by caching.

BTW: Currently kaldi used in test CI is updated manually and I do it from time to time by building a new Docker file and pushing it. Although we plan to update it automatically, we do not know when that will happen.

Also, note that there are some parameter discrepancies on parameters due to inconsistent design. Kaldi expects full range wave form where as typical torchaudio functional expects normalized waveform, yet torchaudio.compliance.kaldi module expects full range values, which confuse users. https://github.com/pytorch/audio/issues/371#issuecomment-625613872, https://github.com/pytorch/audio/issues/328 I think for this test case, we use load_wav with normalize=False, but you might hit something. We have an idea of making kaldi module consistent with the rest of the code base but we have not planned work items yet.

Question 2: How should we handle when some datasets don't have fbank config or don't have mfcc config?

Recommendation - we should only use configs from datasets for testing fbank or mfcc if they have the respective config.

Example: Switchboard has both fbank and mfcc config, thus we will use both for testing.

Example: librispeech only stores mfcc config, thus we will not use librispeech for testing fbank

Yes, that makes sense.

Question 3: What should we do with generate_fbank_data.py?

Currently generate_fbank_data.py generates random parameters, which may be invalid. We could have it make network wget calls to the relevant repositories if possible to retrieve and parse the values. It could inspect Kaldi source code directory or execute the executable path with --help to parse out default values. This sounds hacky and maybe we should skip it for now.

generate_fbank_data.py is obsolete and provides no value. so we can simply delete it. If our tests can incorporate the latest changes on Kaldi side automatically, it would be nice, but at this moment, the priority is to have a good coverage of valid use cases. That itself is a great improvement.

Also making tests depend on external resource (networking, files stored elsewhere) increase maintenance cost, so we would like to refrain from doing it. Parsing help message of executables is plausible because it's available but let's defer on that one. We can discuss the extra value of doing that once we have a good set of values to test.

kiranzo commented 1 day ago

Hi @engineerchuan I'm working on refactoring legacy code in our project: we have 40 Mb of Kaldi MFCC binary which we would like to replace with torchaudio.compliance.kaldi.mfcc

I managed to get nearly identical results between the call of Kaldi binary file and torchaudio.compliance.kaldi.mfcc. I don't know what my Kaldi binary version is, but it and the Torch implementation have several different default params, see table below:

parameter torch_value kaldi_value
blackman_coeff 0.42 0.42
cepstral_lifter 22.0 22
channel -1 -1
dither 0.0 1
energy_floor 1.0 0
frame_length 25.0 25
frame_shift 10.0 10
high_freq 0.0 0
htk_compat False False
low_freq 20.0 20
num_ceps 13 13
min_duration 0.0 0
num_mel_bins 23 23
preemphasis_coefficient 0.97 0.97
raw_energy True True
remove_dc_offset True True
round_to_power_of_two True True
sample_frequency 16000.0 16000
snip_edges True True
subtract_mean False False
use_energy False True
vtln_high -500.0 -500
vtln_low 100.0 100
vtln_warp 1.0 1
window_type povey povey
allow_downsample False
allow_upsample False
debug_mel False
max_feature_vectors -1
output_format kaldi
utt2spk ""
vtln_map ""

As you can see, aside from several missing params, dither, energy_floor and use_energy are set to the opposite. (also, Kaldi has gigantic dithering by default, so I spent good portion of today trying to understand why the two sets of results don't match)