Open ryan-williams opened 8 years ago
I was able to run the joint-caller on this data successfully from a recent HEAD (08dcc6ca7f896056194df5cac85687bc834e7863) three times today with varying numbers of executors: 10, 20, and dynamic (52-369, per stage widths).
--master yarn --deploy-mode cluster
See stats below, including times for the bottleneck Stage 6, which builds pileups and calls variants on them:
Stages page:
Stages page:
Stages page:
These portray a pretty good robustness story, and provide some promising scaling data points.
Going from 10 to 20 executors halved the bottleneck stage (and then some…!), and the whole app ran in 58% of the time. Put another way, they ran as if they had perfect linear scaling outside of just under 4mins of fixed-cost time. That's more than reasonable considerable we lost that doing loci-partitioning broadcasting between the end of stage 4 and the beginning of stage 5, resulting in gaps of 4:08, 3:51, and 4:20 in the 10-, 20-, and dynamic runs, resp. where the driver was the only node doing work.
This and other fixed time-costs weighed further on the linear-scaling null-hypothesis when going from 20 to dynamic (52-359) executors, the latter only running about half as fast:
So that's more than 6mins of fixed-cost, outside of which the dynamic-allocation run was definitely in the ideal linear-scaling range for its 52-359 executors. Of course, the fixed costs matter, but this is still a good sanity check.
In a couple of attempts to run this in "local" mode (--master local
), reading the BAMs from a local NFS mount instead of HDFS, I observed a 20gb driver to OOM; we've discussed trying to make sure local runs are reasonably performant in the past, and this seems like it could be a good test case for ironing out kinks there, since these BAMs are a nice medium-small size that should be doable locally. In particular, it seems that @jstjohn was attempting it this way when he was stymied on #386.
I'll follow up on this and see if I can get it working.
@ryan-williams is this task still in progress?
@jstjohn discussed hitting some issues running on the "case 2" data here.
I'm downloading the data now to attempt to reproduce.