kids-first / kf-alignment-workflow

:microscope: Alignment workflow for Kids-First DRC
Apache License 2.0
10 stars 6 forks source link

🐎 open bam processing resource ins #137

Closed dmiller15 closed 7 months ago

dmiller15 commented 7 months ago

Description

BixOps reported two issues:

As a result, I've opened the CPU and RAM ports for the following steps:

Closes https://github.com/d3b-center/bixu-tracker/issues/2328

Type of change

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

Test Configuration:

Checklist:

dmiller15 commented 7 months ago

too bad Cavatica choose a m4 instance on that 90GB test. Could probably get over that by choosing a different ram amount though

@migbro Yeah those 4th gen instances are a disaster. Their EBS bandwidth is a joke.

dmiller15 commented 7 months ago

@migbro I came up with another optimization. All I did was fold the bwa_payload creation into the end of bamtofastq. The end result is pretty streamlined task execution. See the tasks:

Jury is still out on the download/upload improvement on the c5.2xlarge.

dmiller15 commented 7 months ago

@migbro I went through and tallied up the INPUT/OUTPUT times between the two instances. On INPUT it seems like a wash. The c5.xlarge instances spent a total of 182 minutes downloading the files. The c5.2xlarge instances also spent a total of 182 minutes on downloading the files. My guess is we're running up against either the bandwidth or CPU limit here. Each download is single threaded since both instances are max stacked.

OUTPUT, however, is a completely different story. The c5.xlarge took an eye-watering 328 minutes to upload its outs. The c5.2xlarge dealt with them in 186 minutes. Clearly a big advantage to the 2x large. It's hard to say why exactly. My guess would be that the extra CPUs are what swings this in favor of 2x. Towards the end of the runtime the 2x instances are going to have more free CPUs as early tasks have wrapped up and nothing replaces them. That way the instance can throw those remaining cores at the longer tasks. Looking at the instance metrics you can see this pretty clearly. The CPU will be maxed out uploading a file then will drop to 50% for a short time when nothing is uploading then will shoot back up to 100% to upload the next file.

In summary, for maximally stacked instances more cores doesn't seem to help on INPUT but is a large benefit on OUTPUT. I'm also running the single read group BAM through through this new pipeline right now. That will give us a good idea of what exactly were maxing out on INPUT. If the times are again the same, we're likely bandwidth limited. If the 2x is faster, then it's the CPU limit.

dmiller15 commented 7 months ago

Ok results from the larger test are done:

Here we can clearly see the influence of the extra cores on both INPUT and OUTPUT. In both cases, the c5.2x is performing the IO in half the time of the c5.x. I guess this confirms that what we seeing earlier was likely the result of core constraints on a max stacked instance.

So now down to brass tacks. We know that the difference in IO comes down to cores. So that means a given task can have INPUT that is up to 2x faster and OUTPUT that is up to 5x faster on a c5.2x (in the case of a 4x stacked job your download would be 1 core on the c5.x and 2 cores on the c5.2x and the first upload would be with 1 core on a c5.x versus 5 cores on a c5.2x). Across all 4 of those jobs the c5.2x would enjoy a 2.6x OUTPUT advantage over the c5.x.

Going up to 8 stacked tasks and things get more complicated. The best case OUTPUT advantage continues to shrink and eventually settles at 1.8x. The best case INPUT advantage also continues to shrink and at 8 stacked is completely gone. It is worth noting at this point we would have 2 c5.x large instances running compared to the single c5.2x. So for jobs with 5-8 or 13-16 tasks, the monetary advantage of using a c5.x is nullified by the cost of a second image.

dmiller15 commented 7 months ago

So what's better?

For Single Read Groups

For single instance jobs the answer is determinable:

Therefore, it is cost effective to use the c5.2x when the IO time is twice as long as the runtime on a c5.x. As of yet we have never observed such a case. As a matter of fact, runtime and IO time tend to hover around 1x most of the time.

Simulated cost (based on the tasks in this PR: runtime roughly equal to IO time; OUTPUT longer than INPUT):

For Four or Fewer Read Groups (also 9-12, 17-20, etc.)

For this group things slightly improve for c5.2x; the IO time only needs to be 1.75x larger than the runtime for the c5.2x would be better. Again, I'm yet to observe anything over 1x for that ratio. The odds of a c5.2x ever being advantageous here are very low.

Simulated cost:

Five to Eight Read Groups (also 13-16, 21-24, etc.)

Now things have changed. We have two instances going for the c5.x and an equal number of CPUs in play; so we remove CPU cost from the equation. Runtimes are equal; IO times decrease with CPUs. Now that cost is out of the equation, the c5.2x should be less expensive. There are some niche considerations though: in the case of 5 or 6 read groups, that second c5.x instance would actually have better INPUT performance than the c5.2x. The OUTPUT on the second instance would also be quite good with 3 and 4 CPUs allocated to those cases. The real OUTPUT benefit for c5.2x comes isn't realized until the final four read groups.

Simulated Cost

Wrap Up

Most of the time, it appears that the most cost effective solution is the smaller instance. The additional cost of longer IO is more than balanced by the cheaper tool runtimes. In instances where the number of jobs closely matches the c5.2x CPU count, it derives a cost benefit from a faster OUTPUT. I've talked with @migbro about a potential way to hijack the scheduler but I think that's something we probably want to hash out elsewhere.