srvk / eesen

The official repository of the Eesen project
http://arxiv.org/abs/1507.08240
Apache License 2.0
822 stars 342 forks source link

eesen runns on gridengine qstat shows Eqs #78

Closed zhangjiulong closed 8 years ago

zhangjiulong commented 8 years ago

Hi I get a gridengine cluster which has two nodes(node1,node2), each node has 3 gpus. Also I have another node(node3) as nfs and all the wav data is on node3 then I mount the data dir to node1 and node2(node 1 and node2 only can read the mount dir, can not write)

On node1 and node2 I build eesen separately, and modify the cmd.sh to queue format as follows:

export train_cmd="queue.pl -q all.q -l arch=*64"
export decode_cmd="queue.pl -q all.q -l arch=*64,mem_free=2G,ram_free=2G"
export mkgraph_cmd="queue.pl -q all.q -l arch=*64,ram_free=4G,mem_free=4G"
export big_memory_cmd="queue.pl -q all.q -l arch=*64,ram_free=8G,mem_free=8G"
export cuda_cmd="queue.pl -q all.q -l gpu=1"

But when I runned ./run_ctc_phn.sh on node1 the screen stops on the make fbank step as follows:

steps/make_fbank.sh --cmd queue.pl -q all.q -l arch=*64 --nj 20 data/train exp/make_fbank/train fbank
utils/validate_data_dir.sh: Successfully validated data-directory data/train
steps/make_fbank.sh [info]: segments file exists: using that.

But no gpu or cpu is running and I runned qstat shows that some error happened like this

28 0.50000 make_fbank kaldi        Eqw   07/29/2016 15:57:54                                    1 1-5:1,7-19:2

But no error log is found Please give me some suggestion, thanks very much.

yajiemiao commented 8 years ago

Please check if you have log files exp/make_fbank/train/*.log, and if you see error messages in them.

zhangjiulong commented 8 years ago

Hi @yajiemiao there is no error log in the dir, but after I runed qhost -q I found all.q status is like this

all.q                BIP   0/0/48

is this means that there is no resourece to run the jobe submited ? thx

naxingyu commented 8 years ago

The jobs are in error state, not assigned to any node yet.

Check qstat -j 28 to see the detailed error.

在 2016/8/1 10:58, john 写道:

Hi @yajiemiao https://github.com/yajiemiao there is no error log in the dir, but after I runed qhost -q I found all.q status is like this ··· all.q BIP 0/0/48 ··· is this means that there is no resourece to run the jobe submited ? thx

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/78#issuecomment-236478944, or mute the thread https://github.com/notifications/unsubscribe-auth/ADKpxPOxtBU6-DFNnqwIcJqciNN471K-ks5qbWC-gaJpZM4JX-yW.

zhangjiulong commented 8 years ago

Hi @naxingyu I found something useful, the error log is lik this:

job-array tasks:            1-20:1
error reason    1:          07/29/2016 15:58:05 [1003:17512]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason    2:          07/29/2016 15:58:05 [1003:17515]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason    3:          07/29/2016 15:58:05 [1003:17528]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason    4:          07/29/2016 15:58:05 [1003:17530]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason    5:          07/29/2016 15:58:05 [1003:17519]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason    7:          07/29/2016 15:58:05 [1003:17526]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason    9:          07/29/2016 15:58:05 [1003:17521]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason   11:          07/29/2016 15:58:05 [1003:17532]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason   13:          07/29/2016 15:58:05 [1003:17531]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason   15:          07/29/2016 15:58:05 [1003:17522]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason   17:          07/29/2016 15:58:05 [1003:17529]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason   19:          07/29/2016 15:58:05 [1003:17527]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit

there is a quote before the path, but i don't know how this quote comes. My cmd is like this:

export train_cmd="queue.pl -cwd -q all.q -l arch=*64"
export decode_cmd="queue.pl -cwd -q all.q -l arch=*64,mem_free=2G,ram_free=2G"
export mkgraph_cmd="queue.pl -cwd -q all.q -l arch=*64,ram_free=4G,mem_free=4G"
export big_memory_cmd="queue.pl -cwd -q all.q -l arch=*64,ram_free=8G,mem_free=8G"
export cuda_cmd="queue.pl -cwd -q all.q -l gpu=1"

When I changed the cmd to run.pl the program runs ok thx

naxingyu commented 8 years ago

Check your summitted shell in "q" dir.

在 2016/8/1 13:45, john 写道:

Hi @naxingyu https://github.com/naxingyu I found something useful, the error log is lik this:

|job-array tasks: 1-20:1 error reason 1: 07/29/2016 15:58:05 [1003:17512]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit error reason 2: 07/29/2016 15:58:05 [1003:17515]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit error reason 3: 07/29/2016 15:58:05 [1003:17528]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit error reason 4: 07/29/2016 15:58:05 [1003:17530]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit error reason 5: 07/29/2016 15:58:05 [1003:17519]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit error reason 7: 07/29/2016 15:58:05 [1003:17526]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit error reason 9: 07/29/2016 15:58:05 [1003:17521]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit error reason 11: 07/29/2016 15:58:05 [1003:17532]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit error reason 13: 07/29/2016 15:58:05 [1003:17531]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit error reason 15: 07/29/2016 15:58:05 [1003:17522]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit error reason 17: 07/29/2016 15:58:05 [1003:17529]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit error reason 19: 07/29/2016 15:58:05 [1003:17527]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit |

there is a quote before the path, but i don't know how this quote comes. thx

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/78#issuecomment-236494297, or mute the thread https://github.com/notifications/unsubscribe-auth/ADKpxI7zmtHtSe70-w85kfG_cnl46Db4ks5qbYfqgaJpZM4JX-yW.

zhangjiulong commented 8 years ago

Hi @naxingyu suummited shell is like this, please help me check the shell, thx

#!/bin/bash                                                                                                                                                                                                                                                                   
cd /home/kaldi/git/eesen/asr_egs/timit/vc1
. ./path.sh
( echo '#' Running on `hostname`
  echo '#' Started at `date`
  echo -n '# '; cat <<EOF                                                                                                                                                                                                                                                     
extract-segments scp,p:data/train/wav.scp exp/make_fbank/train/segments.${SGE_TASK_ID} ark:- | compute-fbank-feats --verbose=2 --config=conf/fbank.conf ark:- ark:- | copy-feats --compress=true ark:- ark,scp:/home/kaldi/git/eesen/asr_egs/timit/vc1/fbank/raw_fbank_train.\
${SGE_TASK_ID}.ark,/home/kaldi/git/eesen/asr_egs/timit/vc1/fbank/raw_fbank_train.${SGE_TASK_ID}.scp                                                                                                                                                                           
EOF                                                                                                                                                                                                                                                                           
) >exp/make_fbank/train/make_fbank_train.$SGE_TASK_ID.log
time1=`date +"%s"`
 ( extract-segments scp,p:data/train/wav.scp exp/make_fbank/train/segments.${SGE_TASK_ID} ark:- | compute-fbank-feats --verbose=2 --config=conf/fbank.conf ark:- ark:- | copy-feats --compress=true ark:- ark,scp:/home/kaldi/git/eesen/asr_egs/timit/vc1/fbank/raw_fbank_tra\
in.${SGE_TASK_ID}.ark,/home/kaldi/git/eesen/asr_egs/timit/vc1/fbank/raw_fbank_train.${SGE_TASK_ID}.scp  ) 2>>exp/make_fbank/train/make_fbank_train.$SGE_TASK_ID.log >>exp/make_fbank/train/make_fbank_train.$SGE_TASK_ID.log
ret=$?
time2=`date +"%s"`
echo '#' Accounting: time=$(($time2-$time1)) threads=1 >>exp/make_fbank/train/make_fbank_train.$SGE_TASK_ID.log
echo '#' Finished at `date` with status $ret >>exp/make_fbank/train/make_fbank_train.$SGE_TASK_ID.log
[ $ret -eq 137 ] && exit 100;
touch exp/make_fbank/train/q/done.30991.$SGE_TASK_ID
exit $[$ret ? 1 : 0]
## submitted with:                                                                                                                                                                                                                                                            
# qsub -v PATH -cwd -S /bin/bash -j y -l arch=*64* -o exp/make_fbank/train/q/make_fbank_train.log -q all.q -l arch=*64   -t 1:20 /home/kaldi/git/eesen/asr_egs/timit/vc1/exp/make_fbank/train/q/make_fbank_train.sh >>exp/make_fbank/train/q/make_fbank_train.log 2>&1
naxingyu commented 8 years ago

Seems a NFS problem. Is the master on node1 or node2? The master and execute nodes should have shared file system.

在 2016/8/1 14:46, john 写道:

Hi @naxingyu https://github.com/naxingyu suummited shell is like this, please help me check the shell, thx

|#!/bin/bash cd /home/kaldi/git/eesen/asr_egs/timit/vc1 . ./path.sh ( echo '#' Running on hostname echo '#' Started at date echo -n '# '; cat <<EOF extract-segments scp,p:data/train/wav.scp exp/make_fbank/train/segments.${SGE_TASK_ID} ark:- | compute-fbank-feats --verbose=2 --config=conf/fbank.conf ark:- ark:- | copy-feats --compress=true ark:- ark,scp:/home/kaldi/git/eesen/asr_egs/timit/vc1/fbank/raw_fbank_train.\ ${SGE_TASK_ID}.ark,/home/kaldi/git/eesen/asr_egs/timit/vc1/fbank/raw_fbank_train.${SGE_TASK_ID}.scp EOF ) >exp/make_fbank/train/make_fbank_train.$SGE_TASK_ID.log time1=date +"%s" ( extract-segments scp,p:data/train/wav.scp exp/make_fbank/train/segments.${SGE_TASK_ID} ark:- | compute-fbank-feats --verbose=2 --config=conf/fbank.conf ark:- ark:- | copy-feats --compress=true ark:- ark,scp:/home/kaldi/git/eesen/asr_egs/timit/vc1/fbank/raw_fbank_tra\ in.${SGE_TASK_ID}.ark,/home/kaldi/git/eesen/asr_egs/timit/vc1/fbank/raw_fbank_train.${SGE_TASK_ID}.scp ) 2>>exp/make_fbank/train/make_fbank_train.$SGE_TASK_ID.log

exp/make_fbank/train/make_fbank_train.$SGE_TASK_ID.log ret=$? time2=date +"%s" echo '#' Accounting: time=$(($time2-$time1)) threads=1 >>exp/make_fbank/train/make_fbank_train.$SGE_TASK_ID.log echo '#' Finished at date with status $ret exp/make_fbank/train/make_fbank_train.$SGE_TASK_ID.log [ $ret -eq 137 ] && exit 100; touch exp/make_fbank/train/q/done.30991.$SGE_TASK_ID exit $[$ret ? 1 : 0] ## submitted with: # qsub -v PATH -cwd -S /bin/bash -j y -l arch=64 -o exp/make_fbank/train/q/make_fbank_train.log -q all.q -l arch=*64 -t 1:20 /home/kaldi/git/eesen/asr_egs/timit/vc1/exp/make_fbank/train/q/make_fbank_train.sh exp/make_fbank/train/q/make_fbank_train.log 2>&1 |

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/78#issuecomment-236501902, or mute the thread https://github.com/notifications/unsubscribe-auth/ADKpxMVRV0W1AGnhtBjE_f7NvFMOt7Fxks5qbZY8gaJpZM4JX-yW.

zhangjiulong commented 8 years ago

node1 is master and execute node and node2 is execute node nfs is anoter node, called node3, all my data is on node3 I think if it is the nfs problem the run.pl should runs error, I'm not sure wether it it right or not, but I runed run.pl ok.

zhangjiulong commented 8 years ago

all my data on node3 is mounted to node1 and node2, essen is built on node1 and node2 separately

naxingyu commented 8 years ago

NFS is not just for accessing the raw data. It's set up for sharing the experiment files among execute nodes. Different nodes should have the same write permission to a shared exp dir. And that's not true in your cluster. Have a look at http://kaldi-asr.org/doc/queue.html. run.pl is for a local setup.

在 2016/8/1 15:07, john 写道:

node1 is master and execute node and node2 is execute node nfs is anoter node, called node3, all my data is on node3 I think if it is the nfs problem the run.pl should runs error, I'm not sure wether it it right or not, but I runed run.pl ok.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/78#issuecomment-236504987, or mute the thread https://github.com/notifications/unsubscribe-auth/ADKpxKD7Ws5qSzMwyGGv-3KxQfzMnAWsks5qbZsagaJpZM4JX-yW.

zhangjiulong commented 8 years ago

Hi @naxingyu do you means exp dir, fbank dir and data dir created by eesen should be shared with the same permission, that is the three dirs should be on the nfs?

zhangjiulong commented 8 years ago

Thanks @naxingyu ,As you said It is nfs' problem.

raghav-menon commented 6 years ago

Hi, I am facing the same problem while setting up the cluster. Could you please guide me on how you solved the problem. This was the error which I received

error reason 2: 04/26/2018 11:19:28 [1002:22648]: error: can't chdir to /home/astik/kaldi-trunk/kaldi/egs/swahili/s5 error reason 4: 04/26/2018 11:19:28 [1002:22651]: error: can't chdir to /home/astik/kaldi-trunk/kaldi/egs/swahili/s5 error reason 6: 04/26/2018 11:19:28 [1002:22652]: error: can't chdir to /home/astik/kaldi-trunk/kaldi/egs/swahili/s5 error reason 8: 04/26/2018 11:19:28 [1002:22653]: error: can't chdir to /home/astik/kaldi-trunk/kaldi/egs/swahili/s5 error reason 9: 04/26/2018 11:19:43 [1002:22674]: error: can't chdir to /home/astik/kaldi-trunk/kaldi/egs/swahili/s5 error reason 10: 04/26/2018 11:19:43 [1002:22677]: error: can't chdir to /home/astik/kaldi-trunk/kaldi/egs/swahili/s5 error reason 11: 04/26/2018 11:19:43 [1002:22678]: error: can't chdir to /home/astik/kaldi-trunk/kaldi/egs/swahili/s5 error reason 12: 04/26/2018 11:19:43 [1002:22679]: error: can't chdir to /home/astik/kaldi-trunk/kaldi/egs/swahili/s5 error reason 14: 04/26/2018 11:19:58 [1002:22701]: error: can't chdir to /home/astik/kaldi-trunk/kaldi/egs/swahili/s5 error reason 16: 04/26/2018 11:19:58 [1002:22702]: error: can't chdir to /home/astik/kaldi-trunk/kaldi/egs/swahili/s5 error reason 18: 04/26/2018 11:19:58 [1002:22703]: error: can't chdir to /home/astik/kaldi-trunk/kaldi/egs/swahili/s5 error reason 20: 04/26/2018 11:19:58 [1002:22704]: error: can't chdir to /home/astik/kaldi-trunk/kaldi/egs/swahili/s5

Thanks.

Regards, Raghav

naxingyu commented 6 years ago

The log says the executing node can not access the folder. It looks like a permission issue.

Best, Xingyu Na

On Thu, Apr 26, 2018 at 5:41 PM, raghav-menon notifications@github.com wrote:

Hi, I am facing the same problem while setting up the cluster. Could you please guide me on how you solved the problem. This was the error which I received

error reason 2: 04/26/2018 11:19:28 [1002:22648]: error: can't chdir to /home/astik/kaldi-trunk/kaldi/egs/swahili/s5 error reason 4: 04/26/2018 11:19:28 [1002:22651]: error: can't chdir to /home/astik/kaldi-trunk/kaldi/egs/swahili/s5 error reason 6: 04/26/2018 11:19:28 [1002:22652]: error: can't chdir to /home/astik/kaldi-trunk/kaldi/egs/swahili/s5 error reason 8: 04/26/2018 11:19:28 [1002:22653]: error: can't chdir to /home/astik/kaldi-trunk/kaldi/egs/swahili/s5 error reason 9: 04/26/2018 11:19:43 [1002:22674]: error: can't chdir to /home/astik/kaldi-trunk/kaldi/egs/swahili/s5 error reason 10: 04/26/2018 11:19:43 [1002:22677]: error: can't chdir to /home/astik/kaldi-trunk/kaldi/egs/swahili/s5 error reason 11: 04/26/2018 11:19:43 [1002:22678]: error: can't chdir to /home/astik/kaldi-trunk/kaldi/egs/swahili/s5 error reason 12: 04/26/2018 11:19:43 [1002:22679]: error: can't chdir to /home/astik/kaldi-trunk/kaldi/egs/swahili/s5 error reason 14: 04/26/2018 11:19:58 [1002:22701]: error: can't chdir to /home/astik/kaldi-trunk/kaldi/egs/swahili/s5 error reason 16: 04/26/2018 11:19:58 [1002:22702]: error: can't chdir to /home/astik/kaldi-trunk/kaldi/egs/swahili/s5 error reason 18: 04/26/2018 11:19:58 [1002:22703]: error: can't chdir to /home/astik/kaldi-trunk/kaldi/egs/swahili/s5 error reason 20: 04/26/2018 11:19:58 [1002:22704]: error: can't chdir to /home/astik/kaldi-trunk/kaldi/egs/swahili/s5

Thanks.

Regards, Raghav

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/srvk/eesen/issues/78#issuecomment-384578760, or mute the thread https://github.com/notifications/unsubscribe-auth/ADKpxLd_ENKAe4d5mbZQqRarGkPgJnzVks5tsZY6gaJpZM4JX-yW .

xiaosdawn commented 5 years ago

Hi, I am facing the same problem while setting up the cluster. Using "qstat -j 29" checking the error as following:

script_file: /home/gpu/software/kaldi-20181120/egs/thchs30/s5/exp/make_mfcc/train/q/make_mfcc_train.sh job-array tasks: 1-8:1 error reason 4: 01/07/2019 15:19:42 [0:14111]: error: can't open output file "/home/gpu/software/kaldi-20181120/egs/ error reason 5: 01/07/2019 15:19:42 [0:14112]: error: can't open output file "/home/gpu/software/kaldi-20181120/egs/ error reason 6: 01/07/2019 15:19:42 [0:14113]: error: can't open output file "/home/gpu/software/kaldi-20181120/egs/ error reason 7: 01/07/2019 15:19:42 [0:14114]: error: can't open output file "/home/gpu/software/kaldi-20181120/egs/ error reason 8: 01/07/2019 15:19:42 [0:14115]: error: can't open output file "/home/gpu/software/kaldi-20181120/egs/ scheduling info: Job is in error state

I had checked the output file. It exists. It looks like a permission issue like @naxingyu said. But I don't know how to figure out. Could anyone please guide me on how to solve the problem. Thx.