cluster startup timeout on completely empty queue/nodes

esebesty commented 8 years ago

It seems I can't run my bcbio pipeline, when I'm trying to resume a job that failed previously. It fails with the "The cluster startup timed out. This could be for a couple of reasons. ..." message. The queue is empty, nobody is doing anything on the compute nodes, besides me, and and all of this used to work. As soon as I start bcbio, the controller and engine jobs are submitted, they start running, and then the startup times out. Also the load on the compute nodes jumps to 40-50, but nothing is actually running.

debug log contains:

[2016-04-29T16:51Z] node7: Timing: structural variation initial
[2016-04-29T16:51Z] node7: Timing: hla typing
[2016-04-29T16:51Z] node7: Resource requests: gatk, mutect, picard, samtools, vardict, varscan; memory: 3.50, 2.50, 3.50, 2.00, 3.00, 2.00; cores: 1, 1, 1, 16, 1, 1
[2016-04-29T16:51Z] node7: Configuring 96 jobs to run, using 1 cores each with 3.50g of memory reserved for each job

bcbio engine script contains:

2016-04-29 19:46:17.239 [IPEngineApp] WARNING | No heartbeat in the last 5010 ms (7 time(s) in a row).
2016-04-29 19:46:17.239 [IPEngineApp] WARNING | No heartbeat in the last 5010 ms (7 time(s) in a row).
2016-04-29 19:46:17.239 [IPEngineApp] WARNING | No heartbeat in the last 5010 ms (7 time(s) in a row).
2016-04-29 19:46:17.240 [IPEngineApp] WARNING | No heartbeat in the last 5010 ms (7 time(s) in a row).

but I have no idea why. Any suggestions on where to look?

roryk commented 8 years ago

Hi Endre,

Sorry for the troubles. It's hard to say without looking at the whole log file, could you post up the controller and ipengine log files? It sounds like what is happening is that the controller and engine can't talk to each other, but I'm not sure why that should be if it was working before. Are you submitting the bcbio_nextgen.py command to a compute node, or are you running it on the head node?

esebesty commented 8 years ago

Hi! I'm submitting it on a compute node. Here are the two logfiles: controller and cluster.

lpantano commented 8 years ago

hi,

I don't know if that is related to what was happening to me, but the symptoms were similar. It was hanging every two days. At some point I removed my ~/.ipython folder and it worked until the end this time. I cannot reproduce and I don't know if that was the fix. Just sharing experience. I know it doesn't make a lot of sense, because all ipython files are inside the working directory. But I had a bunch of staff there so I just removed the folder, and when I started the job, I got them again. It has only empty folders, so different than before.

have no other clue, sorry.

esebesty commented 8 years ago

Removing various temp files, cache dirs, and rebooting the nodes did not help. However, I think now that the problem is related to some NFS issue, and that's why the controller and engines can't communicate. So closing the issue, it is probably unrelated to ipython.

roryk commented 8 years ago

Thanks for following up Endre, let us know if we can do anything to help.

roryk / ipython-cluster-helper

cluster startup timeout on completely empty queue/nodes #35