kaldi-asr / kaldi

kaldi-asr/kaldi is the official location of the Kaldi project.
http://kaldi-asr.org
Other
14.24k stars 5.32k forks source link

Bug in egs/chime5/s5b/run.sh stage 4 #3448

Closed RedmondY closed 5 years ago

RedmondY commented 5 years ago

When I run stage 4, I met following problem:

local/prepare_data.sh: Converting transcription to text

local/prepare_data.sh: Creating datadir data/dev_beamformit_dereverb_ref for type="ref"

utils/validate_data_dir.sh: Error: in data/dev_beamformit_dereverb_ref, recording-ids extracted from segments and wav.scp

utils/validate_data_dir.sh: differ, partial diff is:

--- /tmp/kaldi.0Ptj/recordings  2019-07-06 14:11:47.115478422 +0100

+++ /tmp/kaldi.0Ptj/recordings.wav      2019-07-06 14:11:47.118478434 +0100

@@ -2,3 +2,2 @@

 S02_U03.ENH

-S02_U05.ENH

[Lengths are /tmp/kaldi.0Ptj/recordings=6 versus /tmp/kaldi.0Ptj/recordings.wav=5]

Prehaps the deletion of U05 caused the problem? (issue)

danpovey commented 5 years ago

@vimal @sw005320 @siddalmia any idea how to fix this?

RedmondY commented 5 years ago

@danpovey I find that this is not the problem of u05. I add in 'utils/data/fix_data_dir.sh $data' in egs/chime5/s5b/utils/validate_data_dir.sh and this issue will not arise.

'sh utils/data/fix_data_dir.sh $data check_sorted_and_uniq $data/utt2spk

if ! $no_spk_sort; then ! cat $data/utt2spk | sort -k2 | cmp -s - $data/utt2spk && \ echo "$0: utt2spk is not in sorted order when sorted first on speaker-id " && \ echo "(fix this by making speaker-ids prefixes of utt-ids)" && exit 1; fi '

danpovey commented 5 years ago

May not have been a bug / not enough details to tell, anyway.

shf2020 commented 3 years ago

I have a similar bug,can you help me?

local/timit_data_prep.sh: TIMIT data preparation succeeded steps/make_mfcc.sh --cmd run.pl --nj 8 data/train exp/make_mfcc/train mfcc steps/make_mfcc.sh: moving data/train/feats.scp to data/train/.backup fix_data_dir.sh: no utterances remained: not proceeding further. utils/validate_data_dir.sh: Error: in data/train, utterance-ids extracted from utt2spk and utt2dur file utils/validate_data_dir.sh: differ, partial diff is: --- /tmp/kaldi.ljtb/utts 2021-10-25 09:36:31.713458090 +0800 +++ /tmp/kaldi.ljtb/utts.utt2dur 2021-10-25 09:36:31.845456523 +0800 @@ -1,4620 +1,5607 @@ -SP0001W00 -SP0001W01 -SP0001W02 ... +SP0462W04-0000-0246 +SP0462W05-0000-0256 +SP0462W06-0000-0391 +SP0462W07-0000-0374 +SP0462W08-0013-0234 +SP0462W09-0000-0314 [Lengths are /tmp/kaldi.ljtb/utts=4620 versus /tmp/kaldi.ljtb/utts.utt2dur=5607]

ssccutyy commented 2 years ago

I have a similar bug,the main problem is that utt2spk and utt2dur can't match up, utts=4620 versus utt2dur=5607,check these files

mukeshbadgujar commented 2 years ago

I have a similar bug,can you help me?

local/timit_data_prep.sh: TIMIT data preparation succeeded steps/make_mfcc.sh --cmd run.pl --nj 8 data/train exp/make_mfcc/train mfcc steps/make_mfcc.sh: moving data/train/feats.scp to data/train/.backup fix_data_dir.sh: no utterances remained: not proceeding further. utils/validate_data_dir.sh: Error: in data/train, utterance-ids extracted from utt2spk and utt2dur file utils/validate_data_dir.sh: differ, partial diff is: --- /tmp/kaldi.ljtb/utts 2021-10-25 09:36:31.713458090 +0800 +++ /tmp/kaldi.ljtb/utts.utt2dur 2021-10-25 09:36:31.845456523 +0800 @@ -1,4620 +1,5607 @@ -SP0001W00 -SP0001W01 -SP0001W02 ... +SP0462W04-0000-0246 +SP0462W05-0000-0256 +SP0462W06-0000-0391 +SP0462W07-0000-0374 +SP0462W08-0013-0234 +SP0462W09-0000-0314 [Lengths are /tmp/kaldi.ljtb/utts=4620 versus /tmp/kaldi.ljtb/utts.utt2dur=5607]

Is your problem get solved? Please let me know if any solution you have done? I have also similar error like this.

desh2608 commented 2 years ago

These look like issues arising from incorrect data preparation (i.e., not bugs). It is likely that you run some stages multiple times. Try starting from a clean slate and making sure the data is prepared correctly before you move to the feature generation stage. Remove all data dirs prepared previously, run the timit_data_prep.sh script, and then run utils/data/validate_data_dir.sh to make sure the data dir is correct. If it is not, please show the command line output here.

mukeshbadgujar commented 2 years ago

These look like issues arising from incorrect data preparation (i.e., not bugs). It is likely that you run some stages multiple times. Try starting from a clean slate and making sure the data is prepared correctly before you move to the feature generation stage. Remove all data dirs prepared previously, run the timit_data_prep.sh script, and then run utils/data/validate_data_dir.sh to make sure the data dir is correct. If it is not, please show the command line output here.

Thanks for reply, i have checked and found that i am using old data/train_sp folder that contains old files, so i deleted that folder and after running it will be created.

Now i am stuck at new problem trying to solve it if not, will let you know 🙂.

Thanks for helping.

ssccutyy commented 2 years ago

maybe you should check your utt2spk file and dataset dir,it seams like the file name in your utt2spk can't match up with the file name in your dataset dir

------------------ 原始邮件 ------------------ 发件人: "kaldi-asr/kaldi" @.>; 发送时间: 2022年3月22日(星期二) 晚上9:21 @.>; 抄送: "@@.**@.>; 主题: Re: [kaldi-asr/kaldi] Bug in egs/chime5/s5b/run.sh stage 4 (#3448)

I have a similar bug,can you help me?

local/timit_data_prep.sh: TIMIT data preparation succeeded steps/make_mfcc.sh --cmd run.pl --nj 8 data/train exp/make_mfcc/train mfcc steps/make_mfcc.sh: moving data/train/feats.scp to data/train/.backup fix_data_dir.sh: no utterances remained: not proceeding further. utils/validate_data_dir.sh: Error: in data/train, utterance-ids extracted from utt2spk and utt2dur file utils/validate_data_dir.sh: differ, partial diff is: --- /tmp/kaldi.ljtb/utts 2021-10-25 09:36:31.713458090 +0800 +++ /tmp/kaldi.ljtb/utts.utt2dur 2021-10-25 09:36:31.845456523 +0800 @@ -1,4620 +1,5607 @@ -SP0001W00 -SP0001W01 -SP0001W02 ... +SP0462W04-0000-0246 +SP0462W05-0000-0256 +SP0462W06-0000-0391 +SP0462W07-0000-0374 +SP0462W08-0013-0234 +SP0462W09-0000-0314 [Lengths are /tmp/kaldi.ljtb/utts=4620 versus /tmp/kaldi.ljtb/utts.utt2dur=5607]

Is your problem get solved? Please let me know if any solution you have done? I have also similar error like this.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>