srvk / eesen

The official repository of the Eesen project
Apache License 2.0
822 stars 342 forks source link

cluster runns ok one one node and error on two or more nodes. #79

Closed zhangjiulong closed 8 years ago

zhangjiulong commented 8 years ago

Hi I get a gridengine cluster which has three nodes(node1,node2,node3), each node has 3 gpus. node1 is master node, submit node and exector node node2 is submit node and executor node node3 is submit node Also node3 has a nfs service, all the wav data and txt data is on node3. Node1 and node2 mount node3's data and node1 and node2 can only read, can not write the mounted data.

node1, node2 and node3 has the same user name called kaldi and password

Then on node1, node2 and node3's dir named /home/kaldi/git/eesen I built eesen separately. I touched contains uname -a and runned on node3(only submit node) several times, the job was distributed to node1 and node2 and runs ok

Then I touched a file named which contains 'touch ok.fst' cmd and submited on node3 several times, the job was distributed to node1 and node2 and runs ok

Then I went to timit dir and changed the to on the three nodes On node3 I runned, then runed qstat -j 54, I got the following result:

job_number:                 54
exec_file:                  job_scripts/54
submission_time:            Wed Aug  3 10:04:05 2016
owner:                      kaldi
uid:                        1006
group:                      kaldi
gid:                        1006
sge_o_home:                 /home/kaldi
sge_o_log_name:             kaldi
sge_o_path:                 /home/kaldi/git/eesen/asr_egs/timit/vc1/utils/:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../src/netbin:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../src/featbin:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../src/decoderbin:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../src/fstbin:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../tools/openfst/bin:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../tools/irstlm/bin/:/home/kaldi/git/eesen/asr_egs/timit/vc1:/home/kaldi/git/eesen/asr_egs/timit/vc1/utils/:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../src/netbin:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../src/featbin:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../src/decoderbin:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../src/fstbin:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../tools/openfst/bin:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../tools/irstlm/bin/:/home/kaldi/git/eesen/asr_egs/timit/vc1:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games
sge_o_shell:                /bin/bash
sge_o_workdir:              /home/kaldi/git/eesen/asr_egs/timit/vc1
sge_o_host:                 cluster000
account:                    sge
cwd:                        /home/kaldi/git/eesen/asr_egs/timit/vc1
merge:                      y
hard resource_list:         arch=*64
mail_list:                  kaldi@cluster000.kaldi.pingan
notify:                     FALSE
stdout_path_list:           NONE:NONE:exp/make_fbank/train/q/make_fbank_train.log
jobshare:                   0
hard_queue_list:            all.q
shell_list:                 NONE:/bin/bash
env_list:                   PATH=/home/kaldi/git/eesen/asr_egs/timit/vc1/utils/:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../src/netbin:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../src/featbin:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../src/decoderbin:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../src/fstbin:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../tools/openfst/bin:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../tools/irstlm/bin/:/home/kaldi/git/eesen/asr_egs/timit/vc1:/home/kaldi/git/eesen/asr_egs/timit/vc1/utils/:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../src/netbin:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../src/featbin:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../src/decoderbin:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../src/fstbin:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../tools/openfst/bin:/home/kaldi/git/eesen/asr_egs/timit/vc1/../../../tools/irstlm/bin/:/home/kaldi/git/eesen/asr_egs/timit/vc1:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games
script_file:                /home/kaldi/git/eesen/asr_egs/timit/vc1/exp/make_fbank/train/q/
job-array tasks:            1-20:1
error reason    1:          08/03/2016 10:04:20 [1003:10662]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason    2:          08/03/2016 10:04:20 [1003:10663]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason    3:          08/03/2016 10:04:20 [1003:10674]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason    4:          08/03/2016 10:04:20 [1003:10666]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason    5:          08/03/2016 10:04:20 [1003:10665]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason    6:          08/03/2016 10:04:20 [1003:10669]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason    7:          08/03/2016 10:04:20 [1003:10671]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason    8:          08/03/2016 10:04:20 [1003:10664]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason    9:          08/03/2016 10:04:20 [1003:10667]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason   10:          08/03/2016 10:04:20 [1003:10670]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason   11:          08/03/2016 10:04:23 [1004:29627]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason   12:          08/03/2016 10:04:20 [1003:10679]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason   13:          08/03/2016 10:04:23 [1004:29629]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason   14:          08/03/2016 10:04:20 [1003:10678]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason   15:          08/03/2016 10:04:23 [1004:29630]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason   16:          08/03/2016 10:04:20 [1003:10676]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason   17:          08/03/2016 10:04:23 [1004:29631]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason   18:          08/03/2016 10:04:20 [1003:10680]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason   19:          08/03/2016 10:04:23 [1004:29628]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason   20:          08/03/2016 10:04:20 [1003:10675]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
scheduling info:            queue instance "all.q@cluster002.kaldi.pingan" dropped because it is disabled
                            Job is in error state

then I went to node 1 and runed and got the similar result half of the cmd runed ok:

error reason    1:          08/03/2016 10:04:20 [1003:10662]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason    3:          08/03/2016 10:04:20 [1003:10674]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason    5:          08/03/2016 10:04:20 [1003:10665]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason    7:          08/03/2016 10:04:20 [1003:10671]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason    9:          08/03/2016 10:04:20 [1003:10667]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason   11:          08/03/2016 10:04:23 [1004:29627]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason   13:          08/03/2016 10:04:23 [1004:29629]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason   15:          08/03/2016 10:04:23 [1004:29630]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason   17:          08/03/2016 10:04:23 [1004:29631]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit
error reason   19:          08/03/2016 10:04:23 [1004:29628]: error: can't open output file "/home/kaldi/git/eesen/asr_egs/timit

Then I disabled node2 and runed on node1 (the cluster now has only one exector node1) then eesen runed ok

What is the problem? Please help me to find out how the problem comes? Thanks.

zhangjiulong commented 8 years ago

Solved as

xiaosdawn commented 5 years ago

Hi, I am facing the same problem while setting up the cluster. Using "qstat -j 29" checking the error as following:

script_file: /home/gpu/software/kaldi-20181120/egs/thchs30/s5/exp/make_mfcc/train/q/ job-array tasks: 1-8:1 error reason 4: 01/07/2019 15:19:42 [0:14111]: error: can't open output file "/home/gpu/software/kaldi-20181120/egs/ error reason 5: 01/07/2019 15:19:42 [0:14112]: error: can't open output file "/home/gpu/software/kaldi-20181120/egs/ error reason 6: 01/07/2019 15:19:42 [0:14113]: error: can't open output file "/home/gpu/software/kaldi-20181120/egs/ error reason 7: 01/07/2019 15:19:42 [0:14114]: error: can't open output file "/home/gpu/software/kaldi-20181120/egs/ error reason 8: 01/07/2019 15:19:42 [0:14115]: error: can't open output file "/home/gpu/software/kaldi-20181120/egs/ scheduling info: Job is in error state

I had checked the output file. It exists. It looks like a permission issue. But I had use "chown -R" to change the permission, it still had the same error. Could you please guide me on how to solve the problem. Thx.

zhangjiulong commented 5 years ago

try root

xiaosdawn commented 5 years ago

I had solved by adding "no_root_squash" in /etc/exports file : /home/gsadmin * (rw,sync,no_subtree_check,no_root_squash)

Thank you all.