Test running on Baylor "case 2" dataset

ryan-williams commented 8 years ago

@jstjohn discussed hitting some issues running on the "case 2" data here.

I'm downloading the data now to attempt to reproduce.

ryan-williams commented 8 years ago

I was able to run the joint-caller on this data successfully from a recent HEAD (08dcc6ca7f896056194df5cac85687bc834e7863) three times today with varying numbers of executors: 10, 20, and dynamic (52-369, per stage widths).

Common configs:

Spark 1.6.1, Hadoop 2.6.0-cdh5.5.1
--master yarn --deploy-mode cluster
15gb driver
17gb, 6-core executors
- standard size we use on our cluster
- fits 3 on each 64GB, 24-core node, with room to spare
underlying normal/tumor files spanned 311 HDFS blocks

See stats below, including times for the bottleneck Stage 6, which builds pileups and calls variants on them:

First run: 10 executors

60 concurrent tasks, max.
Total time: 47 mins.
- Stage 6: 27 mins.
Stages page:

Second run: 20 executors

120 concurrent tasks, max.
Total time: 27 mins.
- Stage 6: 13 mins.
Stages page:

Third run: dynamically-allocated executors

During the 311-wide stages: 52 executors (for up to 312 concurrent tasks).
During the 2153-wide stages: 359 executors (for up to 2154 concurrent tasks).
Total time: 14 mins.
- Stage 6: 102s.
Stages page:

These portray a pretty good robustness story, and provide some promising scaling data points.

Going from 10 to 20 executors halved the bottleneck stage (and then some…!), and the whole app ran in 58% of the time. Put another way, they ran as if they had perfect linear scaling outside of just under 4mins of fixed-cost time. That's more than reasonable considerable we lost that doing loci-partitioning broadcasting between the end of stage 4 and the beginning of stage 5, resulting in gaps of 4:08, 3:51, and 4:20 in the 10-, 20-, and dynamic runs, resp. where the driver was the only node doing work.

This and other fixed time-costs weighed further on the linear-scaling null-hypothesis when going from 20 to dynamic (52-359) executors, the latter only running about half as fast:

it lost 78s in the beginning, doing unnecessarily-coarse-grained reference-sequence-broadcasting, among other things; cf. #560.
it lost a minute between the last stage's completion and the app's end while writing a VCF.

So that's more than 6mins of fixed-cost, outside of which the dynamic-allocation run was definitely in the ideal linear-scaling range for its 52-359 executors. Of course, the fixed costs matter, but this is still a good sanity check.

Local runs

In a couple of attempts to run this in "local" mode (--master local), reading the BAMs from a local NFS mount instead of HDFS, I observed a 20gb driver to OOM; we've discussed trying to make sure local runs are reasonably performant in the past, and this seems like it could be a good test case for ironing out kinks there, since these BAMs are a nice medium-small size that should be doable locally. In particular, it seems that @jstjohn was attempting it this way when he was stymied on #386.

I'll follow up on this and see if I can get it working.

hammer commented 7 years ago

@ryan-williams is this task still in progress?

hammerlab / guacamole