Generate utt2num_frames with features

kaldi-asr / kaldi

kaldi-asr/kaldi is the official location of the Kaldi project.

Other

14.29k stars 5.32k forks source link

From kaldi-help, "Purpose and use of utt2dur":

@kkm000:

I always carry [the utt2dur] file, compute it carefully to the fifth decimal or have a script compute it, but I just realized I do not understand how it is used and why is it needed.

@danpovey:

Hm. It's needed by certain scripts. Sometimes it's used as a relatively fast way to generate utt2num_frames, which is used by the nnet training scripts for purposes like deciding how many egs files to create. However it's recommended to generate utt2num_frames directly when you dump the features, e.g. MFCC or PLP, as that code already knows and doesn't have to do big disk I/O to find out the info. We should probably change the default in the make_*_feats.sh scripts to generate those files by default, if it's not done already. I would merge the PR. There may possibly be other things that utt2dur is used for, e.g. maybe in diarization.

@kkm000:

D-oh. Many recipes generate it, sometimes at a huge expense of running the full pipeline (e. g. LibriSpeech which uses flac apparently pumps all 1k hours twice; make_utt2dur.sh only supports wav and sphere directly). I am preparing a dataset right at the moment; I'll run through with utt2num_frames and without utt2dur and fix what breaks, if anything.

@danpovey, quick questions if/when you have time.

make_utt2dur.sh converts utt2num_frames only as the last resort, i. e. only in case wav.scp is not present; otherwise it may run the whole pipeline. Is it sensible to prefer creating utt2dur from utt2num_frames, if it is present? The only downside I see is the lower precision, down to frame shift, i. e. 10ms typically. Should be acceptable, I think?
Related to feature extraction in general. When I create data directories, I always place a copy the feature config file that was used to extract features along feats.scp, under the name matching feature type (e. g. mfcc.conf, even if conf/mfcc_hires.conf was used). Helps me a ton to avoid a mess-up of variously extracted features. Do you think we should just always to that?
utils/data/get_frame_shift.sh first blindly creates utt2dur (expensive!), then checks if the file frame_shift exists. I'll rearrange the order, indeed, but a better solution seems to always create the file frame_shift along with features?

kaldi-asr / kaldi

Generate utt2num_frames with features #3303