Closed dmiller15 closed 7 months ago
too bad Cavatica choose a m4 instance on that 90GB test. Could probably get over that by choosing a different ram amount though
@migbro Yeah those 4th gen instances are a disaster. Their EBS bandwidth is a joke.
@migbro I came up with another optimization. All I did was fold the bwa_payload creation into the end of bamtofastq. The end result is pretty streamlined task execution. See the tasks:
Jury is still out on the download/upload improvement on the c5.2xlarge.
@migbro I went through and tallied up the INPUT/OUTPUT times between the two instances. On INPUT it seems like a wash. The c5.xlarge instances spent a total of 182 minutes downloading the files. The c5.2xlarge instances also spent a total of 182 minutes on downloading the files. My guess is we're running up against either the bandwidth or CPU limit here. Each download is single threaded since both instances are max stacked.
OUTPUT, however, is a completely different story. The c5.xlarge took an eye-watering 328 minutes to upload its outs. The c5.2xlarge dealt with them in 186 minutes. Clearly a big advantage to the 2x large. It's hard to say why exactly. My guess would be that the extra CPUs are what swings this in favor of 2x. Towards the end of the runtime the 2x instances are going to have more free CPUs as early tasks have wrapped up and nothing replaces them. That way the instance can throw those remaining cores at the longer tasks. Looking at the instance metrics you can see this pretty clearly. The CPU will be maxed out uploading a file then will drop to 50% for a short time when nothing is uploading then will shoot back up to 100% to upload the next file.
In summary, for maximally stacked instances more cores doesn't seem to help on INPUT but is a large benefit on OUTPUT. I'm also running the single read group BAM through through this new pipeline right now. That will give us a good idea of what exactly were maxing out on INPUT. If the times are again the same, we're likely bandwidth limited. If the 2x is faster, then it's the CPU limit.
Ok results from the larger test are done:
Here we can clearly see the influence of the extra cores on both INPUT and OUTPUT. In both cases, the c5.2x is performing the IO in half the time of the c5.x. I guess this confirms that what we seeing earlier was likely the result of core constraints on a max stacked instance.
So now down to brass tacks. We know that the difference in IO comes down to cores. So that means a given task can have INPUT that is up to 2x faster and OUTPUT that is up to 5x faster on a c5.2x (in the case of a 4x stacked job your download would be 1 core on the c5.x and 2 cores on the c5.2x and the first upload would be with 1 core on a c5.x versus 5 cores on a c5.2x). Across all 4 of those jobs the c5.2x would enjoy a 2.6x OUTPUT advantage over the c5.x.
Going up to 8 stacked tasks and things get more complicated. The best case OUTPUT advantage continues to shrink and eventually settles at 1.8x. The best case INPUT advantage also continues to shrink and at 8 stacked is completely gone. It is worth noting at this point we would have 2 c5.x large instances running compared to the single c5.2x. So for jobs with 5-8 or 13-16 tasks, the monetary advantage of using a c5.x is nullified by the cost of a second image.
For single instance jobs the answer is determinable:
Therefore, it is cost effective to use the c5.2x when the IO time is twice as long as the runtime on a c5.x. As of yet we have never observed such a case. As a matter of fact, runtime and IO time tend to hover around 1x most of the time.
Simulated cost (based on the tasks in this PR: runtime roughly equal to IO time; OUTPUT longer than INPUT):
For this group things slightly improve for c5.2x; the IO time only needs to be 1.75x larger than the runtime for the c5.2x would be better. Again, I'm yet to observe anything over 1x for that ratio. The odds of a c5.2x ever being advantageous here are very low.
Simulated cost:
Now things have changed. We have two instances going for the c5.x and an equal number of CPUs in play; so we remove CPU cost from the equation. Runtimes are equal; IO times decrease with CPUs. Now that cost is out of the equation, the c5.2x should be less expensive. There are some niche considerations though: in the case of 5 or 6 read groups, that second c5.x instance would actually have better INPUT performance than the c5.2x. The OUTPUT on the second instance would also be quite good with 3 and 4 CPUs allocated to those cases. The real OUTPUT benefit for c5.2x comes isn't realized until the final four read groups.
Simulated Cost
Most of the time, it appears that the most cost effective solution is the smaller instance. The additional cost of longer IO is more than balanced by the cheaper tool runtimes. In instances where the number of jobs closely matches the c5.2x CPU count, it derives a cost benefit from a faster OUTPUT. I've talked with @migbro about a potential way to hijack the scheduler but I think that's something we probably want to hash out elsewhere.
Description
BixOps reported two issues:
As a result, I've opened the CPU and RAM ports for the following steps:
c5.xlarge
(4 CPUs/8 GB RAM). As a result, stacking is limited to a max of 4. With the concurrent instances currently at 4, this means we will be able to process up to 16 read groups per pass. That's a pretty solid limit in my opinion.Closes https://github.com/d3b-center/bixu-tracker/issues/2328
Type of change
How Has This Been Tested?
Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration
Test Configuration:
Checklist: