roryk / ipython-cluster-helper

Tool to easily start up an IPython cluster on different schedulers.
148 stars 23 forks source link

engines are killed but controller/wrapper script doesn't die #37

Closed esebesty closed 3 years ago

esebesty commented 8 years ago

I ran into an issue when doing the bam sorting step with bcbio on the PBSPro cluster. Sometimes sorting reaches the memory limit set by default for the engines and the engines are killed by PBSPro. However, the controller and the script that started the bcbio_nextgen.py command keeps going. There are no error messages or anything and I have to kill them with qdel.

The ipcontroller log has things like

2016-06-13 18:02:12.149 [VMFixIPControllerApp] heartbeat::missed 884add43-8276-4a02-8df6-7506efdb4410 : 720
2016-06-13 18:02:17.145 [VMFixIPControllerApp] heartbeat::missed 884add43-8276-4a02-8df6-7506efdb4410 : 721
2016-06-13 18:02:17.146 [VMFixIPControllerApp] registration::unregister_engine(0)

but it just keeps on waiting, it seems to me. Is this the expected behavior?

roryk commented 8 years ago

Hi Endre,

Those controller logs are expected behavior, it is detecting that an engine died and then not sending any more jobs to it. The controller and main job can sometimes get in a state where it should be killing itself when all of the engines have failed and isn't, that might be happening here. That is something we'll have to detect inside of bcbio I think so it might require some engineering to fix.

Regarding the out of memory issue for samtools, in the bcbio_system.yaml file in the path-to-your-bcbio/galaxy/ directory you can edit the memory options for samtools to give it more memory. There is some documentation about how to do that here: http://bcbio-nextgen.readthedocs.io/en/latest/contents/parallel.html#tuning-cores

esebesty commented 8 years ago

Yes, I edited the memory option, and that part ran fine now. However, I had the same issue with GATK and java memory size. I changed that too, and we will see if it runs OK this time.

esebesty commented 8 years ago

Sorry, correction. The GATK BaseRecal died, but all engines and controller kept going without any error. This was quite unfortunate, as I have set PBSPro to notify me in email, if the submitted script died, and I was confident that everything is running fine, until I checked the debug log.

Maybe I can report this in the bcbio repo.

chapmanb commented 8 years ago

Endre; The expected behavior of bcbio is to keep running in the case of an error until all queued jobs finish, then it will error out. So if some jobs fail due to memory issues bcbio will keep running the remaining jobs and not die immediately. Is this the behavior you're seeing? The idea is to process whatever we can and then report about the problematic issues. The engine itself won't die since it's only a tool error, so it continues processing work until the whole cluster is shut down on termination.

Hope this matches with your experience. Practically, it sounds like you need more memory for GATK in your bcbio_system.yaml to resolve the underlying issue and get your analysis finished.

esebesty commented 8 years ago

OK, so maybe there was still one engine that was running a command, that I did not notice, and there is no issue at all. I'll check some details and reopen if I still see something strange.

esebesty commented 7 years ago

Hi,

I'm reopening this as I see some strange behavior (at least to me) and it seems I'm wasting some CPU hours when bcbio is not doing anything but still reserving some resources.

So I have a startup script that I submit to the queue, that will launch the bcbio controller and engines. I have a max walltime of 100h.

run_bcbio.sh
|__ bcbio-controller
      |__bcbio-engine{1,2,3}

run_bcbio.sh starts at time T

bcbio-controller waits in the queue and starts at T+5h

bcbio-engine{1,2,3} waits in the queue and starts at start at T+50h

at this point the run_bcbio.sh script is already running for 50h and the bcbio-controller for 45h.

at T+100h, run_bcbio.sh is killed by PBSPro, but the bcbio-controller and bcbio-engine{1,2,3} keep going. However, nothing is actually done by the engines, I checked the debug and command logs. This leads to wasting some CPU hours.

Is it possible to correct this somehow in the ipython library? So when the run_bcbio.sh script is killed the controller and engines also die? Right now, we are trying to write a script that launches after the run_bcbio.sh script is killed, parses the ipython/log/ip{controller,cluster}-* files to get the qsub process ids and delete them from the queue.

Thanks for any advice,

Endre

chapmanb commented 7 years ago

Endre; Unfortunately it's not easy to fix this from from bcbio since the scheduler is killing the parent job (run_bcbio.sh) that controls everything. Essentially we've handed over control to the scheduler and if it cuts off the parent process no more jobs will get dispatched. To fix, we'd have to be able to intercept or identify the scheduler's kill command and then trigger removal of jobs before dying. I'm not sure how to do this on any scheduler well, much less reliably across multiple ones. This might be a case where it would make sense to set higher limits for your main job to account for the long wait times in your queue. Hope this helps some.