Closed esebesty closed 8 years ago
Hi Endre,
Sorry for the troubles. It's hard to say without looking at the whole log file, could you post up the controller and ipengine log files? It sounds like what is happening is that the controller and engine can't talk to each other, but I'm not sure why that should be if it was working before. Are you submitting the bcbio_nextgen.py command to a compute node, or are you running it on the head node?
Hi! I'm submitting it on a compute node. Here are the two logfiles: controller and cluster.
hi,
I don't know if that is related to what was happening to me, but the symptoms were similar. It was hanging every two days. At some point I removed my ~/.ipython folder and it worked until the end this time. I cannot reproduce and I don't know if that was the fix. Just sharing experience. I know it doesn't make a lot of sense, because all ipython files are inside the working directory. But I had a bunch of staff there so I just removed the folder, and when I started the job, I got them again. It has only empty folders, so different than before.
have no other clue, sorry.
Removing various temp files, cache dirs, and rebooting the nodes did not help. However, I think now that the problem is related to some NFS issue, and that's why the controller and engines can't communicate. So closing the issue, it is probably unrelated to ipython.
Thanks for following up Endre, let us know if we can do anything to help.
It seems I can't run my bcbio pipeline, when I'm trying to resume a job that failed previously. It fails with the "The cluster startup timed out. This could be for a couple of reasons. ..." message. The queue is empty, nobody is doing anything on the compute nodes, besides me, and and all of this used to work. As soon as I start bcbio, the controller and engine jobs are submitted, they start running, and then the startup times out. Also the load on the compute nodes jumps to 40-50, but nothing is actually running.
debug log contains:
bcbio engine script contains:
but I have no idea why. Any suggestions on where to look?