🐎 open bam processing resource ins

dmiller15 commented 7 months ago

Description

BixOps reported two issues:

Sentieon BWA was running out of memory on a very large input
Heavily stacked bamtofastq jobs were using up all available disk space (1 TB) on the stacked VM

As a result, I've opened the CPU and RAM ports for the following steps:

bamtofastq: If a job is running out of disk space, OPs can increase the CPU and/or RAM request for this job to reduce the number of concurrent scatters. Previously, jobs were only requesting 1 GB of RAM; therefore, they were able to stack up to 72 times on a c5.9xlarge instance. I've adjusted the default to be 1 CPU and 2GB RAM. Additionally, I changed the stacking instance to c5.xlarge (4 CPUs/8 GB RAM). As a result, stacking is limited to a max of 4. With the concurrent instances currently at 4, this means we will be able to process up to 16 read groups per pass. That's a pretty solid limit in my opinion.
Sentieon BWA: If a job runs out of RAM, OPs can simply increase the RAM request for this job. Previously jobs were being assigned 32 CPUs and 32 GB RAM; this request would put them on a c5.9xlarge, so they actually had access to 72 GB of RAM. I've upped the default to 72 GB of RAM for clarity.

Closes https://github.com/d3b-center/bixu-tracker/issues/2328

Type of change

[x] New feature (non-breaking change which adds functionality)

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

[x] Stacking on the smaller instance: https://cavatica.sbgenomics.com/u/kfdrc-harmonization/kf-reference-pipeline/tasks/7fb7688a-f201-4d53-9612-b648af87b729/#
[x] More ram for BWA: https://cavatica.sbgenomics.com/u/kfdrc-harmonization/kf-reference-pipeline/tasks/3340cbbe-531d-4d79-81b6-4248d3d4d2e6

Test Configuration:

Environment:
Test files:

Checklist:

[x] My code follows the style guidelines of this project
[x] I have performed a self-review of my own code
[x] I have commented my code, particularly in hard-to-understand areas
[x] I have made corresponding changes to the documentation
[x] My changes generate no new warnings
[x] I have added tests that prove my fix is effective or that my feature works
[x] New and existing unit tests pass locally with my changes
[x] Any dependent changes have been merged and published in downstream modules
[x] I have checked my code and corrected any misspellings
[x] I have committed any related changes to the PR

dmiller15 commented 7 months ago

too bad Cavatica choose a m4 instance on that 90GB test. Could probably get over that by choosing a different ram amount though

@migbro Yeah those 4th gen instances are a disaster. Their EBS bandwidth is a joke.

dmiller15 commented 7 months ago

@migbro I came up with another optimization. All I did was fold the bwa_payload creation into the end of bamtofastq. The end result is pretty streamlined task execution. See the tasks:

Jury is still out on the download/upload improvement on the c5.2xlarge.

dmiller15 commented 7 months ago

@migbro I went through and tallied up the INPUT/OUTPUT times between the two instances. On INPUT it seems like a wash. The c5.xlarge instances spent a total of 182 minutes downloading the files. The c5.2xlarge instances also spent a total of 182 minutes on downloading the files. My guess is we're running up against either the bandwidth or CPU limit here. Each download is single threaded since both instances are max stacked.

OUTPUT, however, is a completely different story. The c5.xlarge took an eye-watering 328 minutes to upload its outs. The c5.2xlarge dealt with them in 186 minutes. Clearly a big advantage to the 2x large. It's hard to say why exactly. My guess would be that the extra CPUs are what swings this in favor of 2x. Towards the end of the runtime the 2x instances are going to have more free CPUs as early tasks have wrapped up and nothing replaces them. That way the instance can throw those remaining cores at the longer tasks. Looking at the instance metrics you can see this pretty clearly. The CPU will be maxed out uploading a file then will drop to 50% for a short time when nothing is uploading then will shoot back up to 100% to upload the next file.

In summary, for maximally stacked instances more cores doesn't seem to help on INPUT but is a large benefit on OUTPUT. I'm also running the single read group BAM through through this new pipeline right now. That will give us a good idea of what exactly were maxing out on INPUT. If the times are again the same, we're likely bandwidth limited. If the 2x is faster, then it's the CPU limit.

dmiller15 commented 7 months ago

Ok results from the larger test are done:

c5.2x: https://cavatica.sbgenomics.com/u/kfdrc-harmonization/kf-reference-pipeline/tasks/7694c4eb-76c5-48f0-a214-3fdbee3d3db8/stats/
c5.x (pre-optimization but still relevant): https://cavatica.sbgenomics.com/u/kfdrc-harmonization/kf-reference-pipeline/tasks/3340cbbe-531d-4d79-81b6-4248d3d4d2e6/stats/

Here we can clearly see the influence of the extra cores on both INPUT and OUTPUT. In both cases, the c5.2x is performing the IO in half the time of the c5.x. I guess this confirms that what we seeing earlier was likely the result of core constraints on a max stacked instance.

So now down to brass tacks. We know that the difference in IO comes down to cores. So that means a given task can have INPUT that is up to 2x faster and OUTPUT that is up to 5x faster on a c5.2x (in the case of a 4x stacked job your download would be 1 core on the c5.x and 2 cores on the c5.2x and the first upload would be with 1 core on a c5.x versus 5 cores on a c5.2x). Across all 4 of those jobs the c5.2x would enjoy a 2.6x OUTPUT advantage over the c5.x.

Going up to 8 stacked tasks and things get more complicated. The best case OUTPUT advantage continues to shrink and eventually settles at 1.8x. The best case INPUT advantage also continues to shrink and at 8 stacked is completely gone. It is worth noting at this point we would have 2 c5.x large instances running compared to the single c5.2x. So for jobs with 5-8 or 13-16 tasks, the monetary advantage of using a c5.x is nullified by the cost of a second image.

dmiller15 commented 7 months ago

So what's better?

For Single Read Groups

For single instance jobs the answer is determinable:

Runtimes are equal; IO times decrease with CPUs; costs increase with CPUs.
Costs are 2x for the larger instance.

Therefore, it is cost effective to use the c5.2x when the IO time is twice as long as the runtime on a c5.x. As of yet we have never observed such a case. As a matter of fact, runtime and IO time tend to hover around 1x most of the time.

Simulated cost (based on the tasks in this PR: runtime roughly equal to IO time; OUTPUT longer than INPUT):

c5.x: $5 for runtime; $2 for INPUT; $3 for OUTPUT = $10.00
c5.2x: $10 for runtime; $2 for INPUT; $3 for OUTPUT = $15.00

For Four or Fewer Read Groups (also 9-12, 17-20, etc.)

INPUT is 2x faster on a c5.2x
OUTPUT is 2-2.6x faster on a c5.2x

For this group things slightly improve for c5.2x; the IO time only needs to be 1.75x larger than the runtime for the c5.2x would be better. Again, I'm yet to observe anything over 1x for that ratio. The odds of a c5.2x ever being advantageous here are very low.

Simulated cost:

c5.x: $5 for runtime; $2 for INPUT; $3 for OUTPUT = $10.00
c5.2x: $10 for runtime; $2 for INPUT; $2.30 for OUTPUT = $14.30

Five to Eight Read Groups (also 13-16, 21-24, etc.)

INPUT is approaching 1x
OUTPUT is approaching 1.8x faster for c5.2x

Now things have changed. We have two instances going for the c5.x and an equal number of CPUs in play; so we remove CPU cost from the equation. Runtimes are equal; IO times decrease with CPUs. Now that cost is out of the equation, the c5.2x should be less expensive. There are some niche considerations though: in the case of 5 or 6 read groups, that second c5.x instance would actually have better INPUT performance than the c5.2x. The OUTPUT on the second instance would also be quite good with 3 and 4 CPUs allocated to those cases. The real OUTPUT benefit for c5.2x comes isn't realized until the final four read groups.

Simulated Cost

c5.x: $10 for runtime; $2 for INPUT; $3 for OUTPUT = $15.00
c5.2x: $10 for runtime; $2 for INPUT; $1.66 for OUTPUT = $13.66

Wrap Up

Most of the time, it appears that the most cost effective solution is the smaller instance. The additional cost of longer IO is more than balanced by the cheaper tool runtimes. In instances where the number of jobs closely matches the c5.2x CPU count, it derives a cost benefit from a faster OUTPUT. I've talked with @migbro about a potential way to hijack the scheduler but I think that's something we probably want to hash out elsewhere.

kids-first / kf-alignment-workflow