Open MIAOKUI opened 7 years ago
In the end, the error turns out to be:0 Engines runningTraceback (most recent call last):
File "./run_qc.py", line 36, in
one of the bcbio-ipengine.err is:2017-01-31 16:41:14.120 [IPEngineApp] Using existing profile dir: u'/HOME/pp325/.ipython/profile_928c8988-e790-11e6-8150-74a4b50003ea' 2017-01-31 16:41:14.284 [IPEngineApp] WARNING | url_file u'/HOME/pp325/.ipython/profile_928c8988-e790-11e6-8150-74a4b50003ea/security/ipcontroller-a3fcb3e3-d3ae-46fa-b766-c18504a04ba6-engine.json' not found 2017-01-31 16:41:14.284 [IPEngineApp] WARNING | Waiting up to 960.0 seconds for it to arrive. 2017-01-31 16:41:18.597 [IPEngineApp] Loading url_file u'/HOME/pp325/.ipython/profile_928c8988-e790-11e6-8150-74a4b50003ea/security/ipcontroller-a3fcb3e3-d3ae-46fa-b766-c18504a04ba6-engine.json' 2017-01-31 16:41:18.653 [IPEngineApp] Registering with controller at tcp://127.0.0.1:51280 slurmd[cn12346]: JOB 4463925 CANCELLED AT 2017-01-31T16:50:02
Then I submit SLURM_engine229c5bdb-ef4f-432c-84ce-659024d0624b. I got this:[pp325@ln31%tianhe2-C pipeline56]$ cat bcbio-ipengine.err.%4463992
2017-01-31 16:55:50.670 [IPEngineApp] Using existing profile dir: u'/HOME/pp325/.ipython/profile_928c8988-e790-11e6-8150-74a4b50003ea'
2017-01-31 16:55:50.684 [IPEngineApp] Loading url_file u'/HOME/pp325/.ipython/profile_928c8988-e790-11e6-8150-74a4b50003ea/security/ipcontroller-a3fcb3e3-d3ae-46fa-b766-c18504a04ba6-engine.json'
2017-01-31 16:55:50.721 [IPEngineApp] Registering with controller at tcp://127.0.0.1:60340
ERROR:tornado.general:Uncaught exception, closing connection.
Traceback (most recent call last):
File "/HOME/pp325/software/anaconda2/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 414, in _run_callback
callback(*args, kwargs)
File "/HOME/pp325/software/anaconda2/lib/python2.7/site-packages/tornado/stack_context.py", line 275, in null_wrapper
return fn(*args, *kwargs)
File "/HOME/pp325/software/anaconda2/lib/python2.7/site-packages/ipyparallel/engine/engine.py", line 146, in
2017-01-31 16:55:50.671 [IPEngineApp] Using existing profile dir: u'/HOME/pp325/.ipython/profile_928c8988-e790-11e6-8150-74a4b50003ea' 2017-01-31 16:55:50.684 [IPEngineApp] Loading url_file u'/HOME/pp325/.ipython/profile_928c8988-e790-11e6-8150-74a4b50003ea/security/ipcontroller-a3fcb3e3-d3ae-46fa-b766-c18504a04ba6-engine.json' 2017-01-31 16:55:50.722 [IPEngineApp] Registering with controller at tcp://127.0.0.1:60340 2017-01-31 17:11:50.826 [IPEngineApp] CRITICAL | Registration timed out after 960.0 seconds
I am having a similar issue on gridengine.
Thanks! Sorry for not responding to this, I totally missed this issue. Could you elaborate on the issue @ro6ert, what are you seeing?
@MIAOKUI sorry for not seeing this, if you still need help I'm happy to help out.
When I run this command:
bcbio_nextgen.py ../config/sampleconfig.yaml -t ipython -n 5 -s sge -q linux01.q -r conmem=12
I get this error message:
The cluster startup timed out. This could be for a couple of reasons. The
most common reason is that the queue you are submitting jobs to is
oversubscribed. You can check if this is what is happening by trying again,
and watching to see if jobs are in a pending state or a running state when
the startup times out. If they are in the pending state, that means we just
need to wait longer for them to start, which you can specify by passing
the --timeout parameter, in minutes.
The second reason is that there is a problem with the controller and engine
jobs being submitted to the scheduler. In the directory you ran from,
you should see files that are named YourScheduler_enginesABunchOfNumbers and
YourScheduler_controllerABunchOfNumbers. If you submit one of those files
manually to your scheduler (for example bsub < YourScheduler_controllerABunchOfNumbers)
You will get a more helpful error message that might help you figure out what
is going wrong.
The third reason is that you need to submit your bcbio_nextgen.py job itself as a job;
bcbio-nextgen needs to run on a compute node, not the login node. So the
command you use to run bcbio-nextgen should be submitted as a job to
the scheduler. You can diagnose this because the controller and engine
jobs will be in the running state, but the cluster will still timeout.
Finally, it may be an issue with how the cluster is configured-- the controller
and engine jobs are unable to talk to each other. They need to be able to open
ports on the machines each of them are running on in order to work. You
can diagnose this as the possible issue by if you have submitted the bcbio-nextgen
job to the scheduler, the bcbio-nextgen main job and the controller and
engine jobs are all in a running state and the cluster still times out. This will
likely to be something that you'll have to talk to the administrators of the cluster
you are using about.
If you need help debugging, please post an issue here and we'll try to help you
with the detective work:
https://github.com/roryk/ipython-cluster-helper/issues
This sample config file is for an mirnaseq pipeline run. It might be related to a cluster configuration issue. I'm getting the same message when I try to run the rnaseq pipeline.
It doesn't look like the cluster is returning the resources necessary to run the jobs for some reason.
toque03<16:57:10> more log/bcbio-nextgen-debug.log
[2018-03-08T20:12Z] toque03: System YAML configuration: /proj/sadevs/sadev00/exome_pipeline/bcbio/galaxy/bcbio_system.yaml
[2018-03-08T20:12Z] toque03: Resource requests: atropos, picard; memory: 3.50; cores: 1, 1
[2018-03-08T20:12Z] toque03: Configuring 5 jobs to run, using 1 cores each with 3.50g of memory reserved for each job
[2018-03-08T21:12Z] toque03: System YAML configuration: /proj/sadevs/sadev00/exome_pipeline/bcbio/galaxy/bcbio_system.yaml
[2018-03-08T21:12Z] toque03: Resource requests: atropos, picard; memory: 3.50; cores: 1, 1
[2018-03-08T21:12Z] toque03: Configuring 5 jobs to run, using 1 cores each with 3.50g of memory reserved for each job
Gotcha-- if you submit the bcbio job and watch the jobs, are the engine jobs actually running on the cluster or are they pending? If they are pending then adding --timeout 2000
to your bcbio-nextgen command will make it wait for 2000 minutes rather than the default 15.
It says 0 engines running so it looks like it's pending? I'll add a longer timeout with the --timeout flag.
If you look at the scheduler with qstat
it should show you what state all the jobs are in. There will be the main bcbio job, a controller job and a job array of engines. Usually the bcbio job and the controller job will be running and the engines waiting for resources.
qstat -u your-user-name
will show just your jobs
-bash-3.2$ qstat job-ID prior name user state submit/start at queue slots ja-task-ID
You will need to submit your main bcbio job to the scheduler as well. So bcbio_nextgen.py ../config/sampleconfig.yaml -t ipython -n 5 -s sge -q linux01.q -r conmem=12
should be in a SGE job submission file itself.
H roryk,
I try to use scheduler to schedule our own pipeline. I try to make a view and create a simple test function to run subprocess shell cmd like following: def run(cmd):
args = shelx.split(cmd)
p = subprocess.Popen(args)
return p test_view = cluster_view(scheduler='slurm', queue = 'work', num_jobs = 1)
with test_view: for cmd in fq_filter: view.map(run, [cmd])
then turns that IPclusterstart looking for ipcluster_config and stop there never move on, logs following: [pp325@ln31%tianhe2-C pipeline56]$ ./run_qc.py [ProfileCreate] Generating default config file: u'/HOME/pp325/.ipython/profile_3d12076e-e789-11e6-bc0c-74a4b50003ea/ipython_config.py' [ProfileCreate] Generating default config file: u'/HOME/pp325/.ipython/profile_3d12076e-e789-11e6-bc0c-74a4b50003ea/ipython_kernel_config.py' [ProfileCreate] Generating default config file: u'/HOME/pp325/.ipython/profile_3d12076e-e789-11e6-bc0c-74a4b50003ea/ipcontroller_config.py' [ProfileCreate] Generating default config file: u'/HOME/pp325/.ipython/profile_3d12076e-e789-11e6-bc0c-74a4b50003ea/ipengine_config.py' [ProfileCreate] Generating default config file: u'/HOME/pp325/.ipython/profile_3d12076e-e789-11e6-bc0c-74a4b50003ea/ipcluster_config.py' 2017-01-31 15:45:33.931 [IPClusterStart] IPYTHONDIR set to: /HOME/pp325/.ipython 2017-01-31 15:45:33.933 [IPClusterStart] Using existing profile dir: u'/HOME/pp325/.ipython/profile_3d12076e-e789-11e6-bc0c-74a4b50003ea' 2017-01-31 15:45:33.933 [IPClusterStart] Searching path [u'/HOME/pp325/pipeline56', u'/HOME/pp325/.ipython/profile_3d12076e-e789-11e6-bc0c-74a4b50003ea', '/usr/local/etc/ipython', '/etc/ipython'] for config files 2017-01-31 15:45:33.934 [IPClusterStart] Attempting to load config file: ipython_config.py 2017-01-31 15:45:33.934 [IPClusterStart] Looking for ipython_config in /etc/ipython 2017-01-31 15:45:33.934 [IPClusterStart] Looking for ipython_config in /usr/local/etc/ipython 2017-01-31 15:45:33.934 [IPClusterStart] Looking for ipython_config in /HOME/pp325/.ipython/profile_3d12076e-e789-11e6-bc0c-74a4b50003ea 2017-01-31 15:45:33.935 [IPClusterStart] Loaded config file: /HOME/pp325/.ipython/profile_3d12076e-e789-11e6-bc0c-74a4b50003ea/ipython_config.py 2017-01-31 15:45:33.936 [IPClusterStart] Looking for ipython_config in /HOME/pp325/pipeline56 2017-01-31 15:45:33.937 [IPClusterStart] Attempting to load config file: ipcluster_dd6192fb_9d8d_48a3_b0ae_0f5520750011_config.py 2017-01-31 15:45:33.937 [IPClusterStart] Looking for ipcontroller_config in /etc/ipython 2017-01-31 15:45:33.937 [IPClusterStart] Looking for ipcontroller_config in /usr/local/etc/ipython 2017-01-31 15:45:33.937 [IPClusterStart] Looking for ipcontroller_config in /HOME/pp325/.ipython/profile_3d12076e-e789-11e6-bc0c-74a4b50003ea 2017-01-31 15:45:33.938 [IPClusterStart] Loaded config file: /HOME/pp325/.ipython/profile_3d12076e-e789-11e6-bc0c-74a4b50003ea/ipcontroller_config.py 2017-01-31 15:45:33.939 [IPClusterStart] Looking for ipcontroller_config in /HOME/pp325/pipeline56 2017-01-31 15:45:33.940 [IPClusterStart] Attempting to load config file: ipcluster_dd6192fb_9d8d_48a3_b0ae_0f5520750011_config.py 2017-01-31 15:45:33.940 [IPClusterStart] Looking for ipengine_config in /etc/ipython 2017-01-31 15:45:33.940 [IPClusterStart] Looking for ipengine_config in /usr/local/etc/ipython 2017-01-31 15:45:33.940 [IPClusterStart] Looking for ipengine_config in /HOME/pp325/.ipython/profile_3d12076e-e789-11e6-bc0c-74a4b50003ea 2017-01-31 15:45:33.941 [IPClusterStart] Loaded config file: /HOME/pp325/.ipython/profile_3d12076e-e789-11e6-bc0c-74a4b50003ea/ipengine_config.py 2017-01-31 15:45:33.941 [IPClusterStart] Looking for ipengine_config in /HOME/pp325/pipeline56 2017-01-31 15:45:33.943 [IPClusterStart] Attempting to load config file: ipcluster_dd6192fb_9d8d_48a3_b0ae_0f5520750011_config.py 2017-01-31 15:45:33.943 [IPClusterStart] Looking for ipcluster_config in /etc/ipython 2017-01-31 15:45:33.943 [IPClusterStart] Looking for ipcluster_config in /usr/local/etc/ipython 2017-01-31 15:45:33.943 [IPClusterStart] Looking for ipcluster_config in /HOME/pp325/.ipython/profile_3d12076e-e789-11e6-bc0c-74a4b50003ea 2017-01-31 15:45:33.944 [IPClusterStart] Loaded config file: /HOME/pp325/.ipython/profile_3d12076e-e789-11e6-bc0c-74a4b50003ea/ipcluster_config.py 2017-01-31 15:45:33.944 [IPClusterStart] Looking for ipcluster_config in /HOME/pp325/pipeline56
Then the stop here never continue.... Can you help me about this ?