phonexiaresearch / VBx-training-recipe

Other
29 stars 11 forks source link

Blocking on "Allocating training subset examples" step for too long #7

Closed lhanzl closed 2 years ago

lhanzl commented 3 years ago

I am reproducing the results of the experiment. The main script runs to stage=4

if [ ${stage} -le 4 ]; then
  echo "$0: Getting neural network training egs";
  local/nnet3/xvector/get_egs_but.sh --cmd "$train_cmd" \ 
    --nj 16 \
    --stage 0 \
    --frames-per-chunk 400 \
    --not-used-frames-percentage 40 \
    --num-archives 1000 \
    --num-diagnostic-archives 1 \
    --num-repeats 10 \
    data/${name}_with_aug_no_sil exp/egs
fi  

and is blocked for a long time as follows.

  echo "$0: Allocating training subset examples"
  ${cmd} ${dir}/log/allocate_examples_train_subset.log \
    sid/nnet3/xvector/allocate_egs_but.py \
      --prefix train_subset \
      --num-repeats=3 \
      --frames-per-chunk=${frames_per_chunk} \
      --num-pdfs=${num_pdfs} --num-jobs=1 \
      --num-archives=${num_diagnostic_archives} \
      --utt2len-filename=${dir}/temp/utt2num_frames.train_subset \
      --utt2int-filename=${dir}/temp/utt2int.train_subset --egs-dir=${dir}  || exit 1

Can anyone tell me what the script is doing and how I can solver this problem. Thank you very much.

Jamiroquai88 commented 3 years ago

Hi, sorry about the reaction time. Does it hang on the training examples or on the validation?

Jamiroquai88 commented 2 years ago

Closed due to inactivity.

lawlict commented 1 year ago

Hi @Jamiroquai88, I also meet the problem. It has hanged on the training examples for a whole day, and log file exp/egs/log/allocate_examples_train_subset.log looks like this:

sid/nnet3/xvector/allocate_egs_but.py --prefix train_subset --num-repeats=3 --frames-per-chunk=400 --num-pdfs=7323 --num-jobs=1 --num-archives=1 --utt2len-filename=exp/egs/temp/utt2num_frames.train_subset --utt2int-filename=exp/egs/temp/utt2int.train_subset --egs-dir=exp/egs
Starting get_utt2len
Starting get_labels
Processing archive 1

Look for your kind help. Thanks!

Jamiroquai88 commented 1 year ago

Hi, if this only hangs on the train_subset and not on the main training part, I would skip this step. It has been some time since this was implemented and I am not sure what is causing this issue.

Jamiroquai88 commented 1 year ago

more detailed steps by @MichalKlco: The training script doesn't use it. In the local/nnet3/xvector/get_egs_but.sh script, you have to comment all the parts related to train_subset in each stage, otherwise, it will fail on the way (stage 2-5). Stage 5 is little bit tricky, lines 217-222 seems like clearing some garbage (should be commented out if you skip valid/train_subset) and lines 224-227 should be commented out, too.

lawlict commented 1 year ago

I get it. Thanks for your response!