bcbio scaling tests: Docker required, stdout/stderr buffer, many parallel jobs

chapmanb commented 7 years ago

Thank you for all the help on getting bcbio running with our test CWL (#94). We've revamped how bcbio runs the CWL to avoid all the command line issues we were hitting and have started testing bunny on a larger sample, a NA12878 single chromosome validation for the GA4GH workflow execution challenge:

https://github.com/bcbio/bcbio_validation_workflows

We're hoping to identify any scaling issue and ran into three problems:

When using --no-container rabix still needs Docker. It appears to download a ubuntu image even if running using a non-Docker approach. Is it possible to avoid this so we can run on local systems without Docker?
Jobs with a lot of stdout/stderr will eventually lock up and fail to finish, apparently due to filling up the buffer that bunny uses to store them. The behavior I see is that my internal logging can never write to stdout and blocks indefinitely. If I avoid writing to standard out and redirect to a log I can work around this, but ideally we could support some way of also including these. What does bunny do with output that does to stdout/stderr?
When running many simultaneous jobs, bcbio does not respect system requirements. I have ~600 jobs which can run in parallel, each requesting 1 core and 3Gb of memory. When running on an 8 core machine bunny schedules them all simultaneously instead of 8 at a time. This makes the machine pretty unhappy.

I'm definitely happy to expand on any of these (or discuss more in separate issues, whatever is easier). Thanks again for all the help, looking forward to having bunny working on the GA4GH challenge CWL.

simonovic86 commented 7 years ago

Thanks for all the feedback, we really appreciate it. As for the problems, we will start working on them right away.

The first problem is definitely an issue with Bunny. We identified that as well and we'll fix it ASAP. The reason why ubuntu is still being pulled is because of executor.set_permissions=true. Bunny uses Docker to change permissions of some files if it needs to. You can set executor.set_permissions to false. That will solve the problem for now.

The third problem can be solved by setting resource.fitter.enabled to true. That will enable Bunny to schedule jobs in respect to resources. By default, Bunny schedules every job to execution. We should set the property to true by default.

As for the second problem, if I'm not mistaken, we are doing everything according to the spec. I need to investigate this one further.

Thanks again for the feedback!

chapmanb commented 7 years ago

Janko; Thanks for the quick feedback, this is so helpful. Swapping over those variables in config/core.properties resolved both of the problems and let the NA12878 pipeline run through to completion. Awesome, I'll switch over these defaults in the bioconda bunny install in the short term and then can re-evaluate on the next version.

For the second issue, I'm not sure this is a spec question as much as an implementation detail of what bunny does with stdout/stderr. In this run specifically, bwa generates a bunch of output which seems to overwhealm the buffer that bunny uses for storing it. I'm only guessing here, as it's not clear to me what happens to stdout/stderr in bunny and where it gets redirected. Ideally we'd be able to write whatever happens and see it reflected somewhere in the run directory for debugging. In the short term only writing a file fixes the issue and lets us run, but hope that explains better my thinking around that issue.

Thank you again for all the help.

rabix / bunny

bcbio scaling tests: Docker required, stdout/stderr buffer, many parallel jobs #258