roryk / ipython-cluster-helper

Tool to easily start up an IPython cluster on different schedulers.
148 stars 23 forks source link

"Looking for ipcluster_config" never continue #46

Open MIAOKUI opened 7 years ago

MIAOKUI commented 7 years ago

H roryk,

I try to use scheduler to schedule our own pipeline. I try to make a view and create a simple test function to run subprocess shell cmd like following: def run(cmd):
args = shelx.split(cmd)
p = subprocess.Popen(args)
return p test_view = cluster_view(scheduler='slurm', queue = 'work', num_jobs = 1)

with test_view: for cmd in fq_filter: view.map(run, [cmd])

then turns that IPclusterstart looking for ipcluster_config and stop there never move on, logs following: [pp325@ln31%tianhe2-C pipeline56]$ ./run_qc.py [ProfileCreate] Generating default config file: u'/HOME/pp325/.ipython/profile_3d12076e-e789-11e6-bc0c-74a4b50003ea/ipython_config.py' [ProfileCreate] Generating default config file: u'/HOME/pp325/.ipython/profile_3d12076e-e789-11e6-bc0c-74a4b50003ea/ipython_kernel_config.py' [ProfileCreate] Generating default config file: u'/HOME/pp325/.ipython/profile_3d12076e-e789-11e6-bc0c-74a4b50003ea/ipcontroller_config.py' [ProfileCreate] Generating default config file: u'/HOME/pp325/.ipython/profile_3d12076e-e789-11e6-bc0c-74a4b50003ea/ipengine_config.py' [ProfileCreate] Generating default config file: u'/HOME/pp325/.ipython/profile_3d12076e-e789-11e6-bc0c-74a4b50003ea/ipcluster_config.py' 2017-01-31 15:45:33.931 [IPClusterStart] IPYTHONDIR set to: /HOME/pp325/.ipython 2017-01-31 15:45:33.933 [IPClusterStart] Using existing profile dir: u'/HOME/pp325/.ipython/profile_3d12076e-e789-11e6-bc0c-74a4b50003ea' 2017-01-31 15:45:33.933 [IPClusterStart] Searching path [u'/HOME/pp325/pipeline56', u'/HOME/pp325/.ipython/profile_3d12076e-e789-11e6-bc0c-74a4b50003ea', '/usr/local/etc/ipython', '/etc/ipython'] for config files 2017-01-31 15:45:33.934 [IPClusterStart] Attempting to load config file: ipython_config.py 2017-01-31 15:45:33.934 [IPClusterStart] Looking for ipython_config in /etc/ipython 2017-01-31 15:45:33.934 [IPClusterStart] Looking for ipython_config in /usr/local/etc/ipython 2017-01-31 15:45:33.934 [IPClusterStart] Looking for ipython_config in /HOME/pp325/.ipython/profile_3d12076e-e789-11e6-bc0c-74a4b50003ea 2017-01-31 15:45:33.935 [IPClusterStart] Loaded config file: /HOME/pp325/.ipython/profile_3d12076e-e789-11e6-bc0c-74a4b50003ea/ipython_config.py 2017-01-31 15:45:33.936 [IPClusterStart] Looking for ipython_config in /HOME/pp325/pipeline56 2017-01-31 15:45:33.937 [IPClusterStart] Attempting to load config file: ipcluster_dd6192fb_9d8d_48a3_b0ae_0f5520750011_config.py 2017-01-31 15:45:33.937 [IPClusterStart] Looking for ipcontroller_config in /etc/ipython 2017-01-31 15:45:33.937 [IPClusterStart] Looking for ipcontroller_config in /usr/local/etc/ipython 2017-01-31 15:45:33.937 [IPClusterStart] Looking for ipcontroller_config in /HOME/pp325/.ipython/profile_3d12076e-e789-11e6-bc0c-74a4b50003ea 2017-01-31 15:45:33.938 [IPClusterStart] Loaded config file: /HOME/pp325/.ipython/profile_3d12076e-e789-11e6-bc0c-74a4b50003ea/ipcontroller_config.py 2017-01-31 15:45:33.939 [IPClusterStart] Looking for ipcontroller_config in /HOME/pp325/pipeline56 2017-01-31 15:45:33.940 [IPClusterStart] Attempting to load config file: ipcluster_dd6192fb_9d8d_48a3_b0ae_0f5520750011_config.py 2017-01-31 15:45:33.940 [IPClusterStart] Looking for ipengine_config in /etc/ipython 2017-01-31 15:45:33.940 [IPClusterStart] Looking for ipengine_config in /usr/local/etc/ipython 2017-01-31 15:45:33.940 [IPClusterStart] Looking for ipengine_config in /HOME/pp325/.ipython/profile_3d12076e-e789-11e6-bc0c-74a4b50003ea 2017-01-31 15:45:33.941 [IPClusterStart] Loaded config file: /HOME/pp325/.ipython/profile_3d12076e-e789-11e6-bc0c-74a4b50003ea/ipengine_config.py 2017-01-31 15:45:33.941 [IPClusterStart] Looking for ipengine_config in /HOME/pp325/pipeline56 2017-01-31 15:45:33.943 [IPClusterStart] Attempting to load config file: ipcluster_dd6192fb_9d8d_48a3_b0ae_0f5520750011_config.py 2017-01-31 15:45:33.943 [IPClusterStart] Looking for ipcluster_config in /etc/ipython 2017-01-31 15:45:33.943 [IPClusterStart] Looking for ipcluster_config in /usr/local/etc/ipython 2017-01-31 15:45:33.943 [IPClusterStart] Looking for ipcluster_config in /HOME/pp325/.ipython/profile_3d12076e-e789-11e6-bc0c-74a4b50003ea 2017-01-31 15:45:33.944 [IPClusterStart] Loaded config file: /HOME/pp325/.ipython/profile_3d12076e-e789-11e6-bc0c-74a4b50003ea/ipcluster_config.py 2017-01-31 15:45:33.944 [IPClusterStart] Looking for ipcluster_config in /HOME/pp325/pipeline56

Then the stop here never continue.... Can you help me about this ?

MIAOKUI commented 7 years ago

In the end, the error turns out to be:0 Engines runningTraceback (most recent call last): File "./run_qc.py", line 36, in for cmd in fq_filter: File "/HOME/pp325/software/anaconda2/lib/python2.7/contextlib.py", line 17, in enter return self.gen.next() File "/HOME/pp325/software/anaconda2/lib/python2.7/site-packages/cluster_helper/cluster.py", line 1069, in cluster_view wait_for_all_engines=wait_for_all_engines) File "/HOME/pp325/software/anaconda2/lib/python2.7/site-packages/cluster_helper/cluster.py", line 1024, in init """) IOError:

MIAOKUI commented 7 years ago

one of the bcbio-ipengine.err is:2017-01-31 16:41:14.120 [IPEngineApp] Using existing profile dir: u'/HOME/pp325/.ipython/profile_928c8988-e790-11e6-8150-74a4b50003ea' 2017-01-31 16:41:14.284 [IPEngineApp] WARNING | url_file u'/HOME/pp325/.ipython/profile_928c8988-e790-11e6-8150-74a4b50003ea/security/ipcontroller-a3fcb3e3-d3ae-46fa-b766-c18504a04ba6-engine.json' not found 2017-01-31 16:41:14.284 [IPEngineApp] WARNING | Waiting up to 960.0 seconds for it to arrive. 2017-01-31 16:41:18.597 [IPEngineApp] Loading url_file u'/HOME/pp325/.ipython/profile_928c8988-e790-11e6-8150-74a4b50003ea/security/ipcontroller-a3fcb3e3-d3ae-46fa-b766-c18504a04ba6-engine.json' 2017-01-31 16:41:18.653 [IPEngineApp] Registering with controller at tcp://127.0.0.1:51280 slurmd[cn12346]: JOB 4463925 CANCELLED AT 2017-01-31T16:50:02

MIAOKUI commented 7 years ago

Then I submit SLURM_engine229c5bdb-ef4f-432c-84ce-659024d0624b. I got this:[pp325@ln31%tianhe2-C pipeline56]$ cat bcbio-ipengine.err.%4463992 2017-01-31 16:55:50.670 [IPEngineApp] Using existing profile dir: u'/HOME/pp325/.ipython/profile_928c8988-e790-11e6-8150-74a4b50003ea' 2017-01-31 16:55:50.684 [IPEngineApp] Loading url_file u'/HOME/pp325/.ipython/profile_928c8988-e790-11e6-8150-74a4b50003ea/security/ipcontroller-a3fcb3e3-d3ae-46fa-b766-c18504a04ba6-engine.json' 2017-01-31 16:55:50.721 [IPEngineApp] Registering with controller at tcp://127.0.0.1:60340 ERROR:tornado.general:Uncaught exception, closing connection. Traceback (most recent call last): File "/HOME/pp325/software/anaconda2/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 414, in _run_callback callback(*args, kwargs) File "/HOME/pp325/software/anaconda2/lib/python2.7/site-packages/tornado/stack_context.py", line 275, in null_wrapper return fn(*args, *kwargs) File "/HOME/pp325/software/anaconda2/lib/python2.7/site-packages/ipyparallel/engine/engine.py", line 146, in self.registrar.on_recv(lambda msg: self.complete_registration(msg, connect, maybe_tunnel)) File "/HOME/pp325/software/anaconda2/lib/python2.7/site-packages/ipyparallel/engine/engine.py", line 170, in complete_registration if content['status'] == 'ok': KeyError: 'status' ERROR:tornado.general:Uncaught exception, closing connection. Traceback (most recent call last): File "/HOME/pp325/software/anaconda2/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 440, in _handle_events self._handle_recv() File "/HOME/pp325/software/anaconda2/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 472, in _handle_recv self._run_callback(callback, msg) File "/HOME/pp325/software/anaconda2/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 414, in _run_callback callback(args, kwargs) File "/HOME/pp325/software/anaconda2/lib/python2.7/site-packages/tornado/stack_context.py", line 275, in null_wrapper return fn(*args, kwargs) File "/HOME/pp325/software/anaconda2/lib/python2.7/site-packages/ipyparallel/engine/engine.py", line 146, in self.registrar.on_recv(lambda msg: self.complete_registration(msg, connect, maybe_tunnel)) File "/HOME/pp325/software/anaconda2/lib/python2.7/site-packages/ipyparallel/engine/engine.py", line 170, in complete_registration if content['status'] == 'ok': KeyError: 'status' ERROR:tornado.application:Exception in callback None Traceback (most recent call last): File "/HOME/pp325/software/anaconda2/lib/python2.7/site-packages/tornado/ioloop.py", line 887, in start handler_func(fd_obj, events) File "/HOME/pp325/software/anaconda2/lib/python2.7/site-packages/tornado/stack_context.py", line 275, in null_wrapper return fn(*args, *kwargs) File "/HOME/pp325/software/anaconda2/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 440, in _handle_events self._handle_recv() File "/HOME/pp325/software/anaconda2/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 472, in _handle_recv self._run_callback(callback, msg) File "/HOME/pp325/software/anaconda2/lib/python2.7/site-packages/zmq/eventloop/zmqstream.py", line 414, in _run_callback callback(args, kwargs) File "/HOME/pp325/software/anaconda2/lib/python2.7/site-packages/tornado/stack_context.py", line 275, in null_wrapper return fn(*args, **kwargs) File "/HOME/pp325/software/anaconda2/lib/python2.7/site-packages/ipyparallel/engine/engine.py", line 146, in self.registrar.on_recv(lambda msg: self.complete_registration(msg, connect, maybe_tunnel)) File "/HOME/pp325/software/anaconda2/lib/python2.7/site-packages/ipyparallel/engine/engine.py", line 170, in complete_registration if content['status'] == 'ok': KeyError: 'status'

MIAOKUI commented 7 years ago

2017-01-31 16:55:50.671 [IPEngineApp] Using existing profile dir: u'/HOME/pp325/.ipython/profile_928c8988-e790-11e6-8150-74a4b50003ea' 2017-01-31 16:55:50.684 [IPEngineApp] Loading url_file u'/HOME/pp325/.ipython/profile_928c8988-e790-11e6-8150-74a4b50003ea/security/ipcontroller-a3fcb3e3-d3ae-46fa-b766-c18504a04ba6-engine.json' 2017-01-31 16:55:50.722 [IPEngineApp] Registering with controller at tcp://127.0.0.1:60340 2017-01-31 17:11:50.826 [IPEngineApp] CRITICAL | Registration timed out after 960.0 seconds

ro6ert commented 6 years ago

I am having a similar issue on gridengine.

roryk commented 6 years ago

Thanks! Sorry for not responding to this, I totally missed this issue. Could you elaborate on the issue @ro6ert, what are you seeing?

@MIAOKUI sorry for not seeing this, if you still need help I'm happy to help out.

ro6ert commented 6 years ago

When I run this command:

bcbio_nextgen.py ../config/sampleconfig.yaml -t ipython -n 5 -s sge -q linux01.q -r conmem=12

I get this error message:

    The cluster startup timed out. This could be for a couple of reasons. The
    most common reason is that the queue you are submitting jobs to is
    oversubscribed. You can check if this is what is happening by trying again,
    and watching to see if jobs are in a pending state or a running state when
    the startup times out. If they are in the pending state, that means we just
    need to wait longer for them to start, which you can specify by passing
    the --timeout parameter, in minutes.

    The second reason is that there is a problem with the controller and engine
    jobs being submitted to the scheduler. In the directory you ran from,
    you should see files that are named YourScheduler_enginesABunchOfNumbers and
    YourScheduler_controllerABunchOfNumbers. If you submit one of those files
    manually to your scheduler (for example bsub < YourScheduler_controllerABunchOfNumbers)
    You will get a more helpful error message that might help you figure out what
    is going wrong.

    The third reason is that you need to submit your bcbio_nextgen.py job itself as a job;
    bcbio-nextgen needs to run on a compute node, not the login node. So the
    command you use to run bcbio-nextgen should be submitted as a job to
    the scheduler. You can diagnose this because the controller and engine
    jobs will be in the running state, but the cluster will still timeout.

    Finally, it may be an issue with how the cluster is configured-- the controller
    and engine jobs are unable to talk to each other. They need to be able to open
    ports on the machines each of them are running on in order to work. You
    can diagnose this as the possible issue by if you have submitted the bcbio-nextgen
    job to the scheduler, the bcbio-nextgen main job and the controller and
    engine jobs are all in a running state and the cluster still times out. This will
    likely to be something that you'll have to talk to the administrators of the cluster
    you are using about.

    If you need help debugging, please post an issue here and we'll try to help you
    with the detective work:

    https://github.com/roryk/ipython-cluster-helper/issues

This sample config file is for an mirnaseq pipeline run. It might be related to a cluster configuration issue. I'm getting the same message when I try to run the rnaseq pipeline.

It doesn't look like the cluster is returning the resources necessary to run the jobs for some reason.

toque03<16:57:10> more log/bcbio-nextgen-debug.log
[2018-03-08T20:12Z] toque03: System YAML configuration: /proj/sadevs/sadev00/exome_pipeline/bcbio/galaxy/bcbio_system.yaml [2018-03-08T20:12Z] toque03: Resource requests: atropos, picard; memory: 3.50; cores: 1, 1 [2018-03-08T20:12Z] toque03: Configuring 5 jobs to run, using 1 cores each with 3.50g of memory reserved for each job [2018-03-08T21:12Z] toque03: System YAML configuration: /proj/sadevs/sadev00/exome_pipeline/bcbio/galaxy/bcbio_system.yaml [2018-03-08T21:12Z] toque03: Resource requests: atropos, picard; memory: 3.50; cores: 1, 1 [2018-03-08T21:12Z] toque03: Configuring 5 jobs to run, using 1 cores each with 3.50g of memory reserved for each job

roryk commented 6 years ago

Gotcha-- if you submit the bcbio job and watch the jobs, are the engine jobs actually running on the cluster or are they pending? If they are pending then adding --timeout 2000 to your bcbio-nextgen command will make it wait for 2000 minutes rather than the default 15.

ro6ert commented 6 years ago

It says 0 engines running so it looks like it's pending? I'll add a longer timeout with the --timeout flag.

roryk commented 6 years ago

If you look at the scheduler with qstat it should show you what state all the jobs are in. There will be the main bcbio job, a controller job and a job array of engines. Usually the bcbio job and the controller job will be running and the engines waiting for resources.

roryk commented 6 years ago

qstat -u your-user-name will show just your jobs

ro6ert commented 6 years ago

-bash-3.2$ qstat job-ID prior name user state submit/start at queue slots ja-task-ID


159 0.60000 bcbio-e rerpc dr 03/08/2018 16:29:20 5 1 253 0.50000 bcbio-c rerpc r 03/08/2018 16:35:20 1 1 255 0.00000 bcbio-e rerpc qw 03/08/2018 16:35:22 5 1
roryk commented 6 years ago

You will need to submit your main bcbio job to the scheduler as well. So bcbio_nextgen.py ../config/sampleconfig.yaml -t ipython -n 5 -s sge -q linux01.q -r conmem=12 should be in a SGE job submission file itself.