precisely / bioinformatics

0 stars 0 forks source link

Investigate Beagle CPU usage #56

Open gcv opened 6 years ago

gcv commented 6 years ago

Imputation takes 14.5 hours on a Fargate container with 4096 CPU units and 8GB of RAM. Need to kick off an imputation, then log into the container and take a look at its memory and CPU pressure. Also need to check if the number of cores matches the number of cores Beagle is supposed to use.

If it looks like Beagle under-utilizes the machine, ask Beagle people how to make it more efficient.

taltman commented 6 years ago

Are you running all chromosomes through Beagle in one instance?

If so, then there's no surprise that the wall-time from beginning to end is not great. Most bioinformatics software doesn't do a great job exploiting parallelism. That is why beagle-leash opts to run multiple beagle instances in parallel, rather than use the CPU parallelism of a single instance. So the answer to get the fastest completion of a whole genome is to get an instance with high memory, so that multiple beagle Java instances can load into memory in parallel. I recall using an instance with ~60 Gbytes of RAM to process all chromosomes in parallel. You probably will be fine with just 2 cores per beagle instance.

If I am misunderstanding the optimization objective, please let me know.

gcv commented 6 years ago

Thanks @taltman! I'm using beagle-leash with the core setting set to 4, but I'm also using the BEAGLE_LEASH_CHROMS variable to select which chromosome to impute. But all that runs sequentially (i.e., I loop over chromosome numbers, set the variable, and then call beagle-leash).

What's the best way to get more parallelism out of that?

taltman commented 6 years ago

The best way to get more parallelism out of beagle-leash is to run it as intended. :-) That means not using the BEAGLE_LEASH_CHROMS environment variable to restrict the number of chromosomes running in parallel. There's a front-end script for running beagle-leash. It gives an option for running it in parallel (read the beagle-leash script for more details). You provide `-j 24' as an argument (# of chromosomes), and then GNU Make will figure out all of the parallel targets that need to be processed, and it will spawn up to 46 processes at a time to complete the tasks. GNU Make monitors system load and memory usage, so it will not use all 24 cores if it is hitting up against system limits. Like I said before, you will need ~2-3 Gigabytes of RAM per instance, so try machines with 60-80 Gbytes of RAM to run it massively parallel. Let me know if you have any problems running it that way. Good luck!

gcv commented 6 years ago

Thanks for the hints! A few observations:

First, not setting BEAGLE_LEASH_CHROMS only imputes chromosomes 1, 22, and X. I noticed this on line 37 of run-beagle-pipeline.make, and have confirmed this by running an imputation. It indeed only output imputed values for chromosomes 1, 22, and X.

Second, the entry point script takes an nprocs parameter, which I definitely use. I have been using a value of 4, figuring that's a good place to start. That looks like it indeed turns into a -j parameter to Make. Is it then the case that Beagle does not parallelize imputation on a single chromosome, and that beagle-leash orchestrates running imputations on multiple chromosomes, with one process per chromosome?

gcv commented 6 years ago

Well, I tried it. Used nearly the maximum available in ECS, 30GB of RAM, and 8 chromosomes at a time (effectively make -j8). That took about 7 hours, or only about twice as fast as imputing one chromosome at a time.

The way I did it was by setting BEAGLE_LEASH_CHROMS to values like 1 2 3 4 5 6 7 8 and passing 8 into the launch script. It certainly looks like it started 8 JVMs at a time, though I did not log into the container to check RAM or CPU use.

@taltman, does that seem reasonable? It certainly looks like beagle-leash was doing the right things, but only twice the speedup for 8× the parallelism suggests an inefficiency somewhere.

taltman commented 6 years ago

Hi Constantine,

Thanks for mentioning that bug in beagle-leash in the default setting of |BEAGLE_LEASH_CHROMS|, I went ahead and fixed it and pushed the changes to the repo. It will now enumerate chromosomes 1-22, and X.

Yes, beagle-leash does parallelization by running a single chromosome on a single beagle instance, but by running multiple instances at a time. My experience has been that this is memory-limited, as each Java instance wants a chunk of memory for loading the corresponding reference DB. My suspicion is that with 8 parallel beagle processes on a machine with 30GB RAM is that there might be memory contention, but it is borderline. Being able to connect to the running image and inspect its state would be the way to go.

The program 'beagle' has its own multi-core functionality, but from my experience watching it execute, it only uses more than one core for a small fraction of the run-time. In fact by default beagle will try to use all the available cores, which is how it works currently using beagle-leash. I've never seen CPU contention even with multiple parallel beagle instances running.

I/O contention can be an issue too. I did most of my benchmarking on SSD-backed ephemeral EC2 storage (for all of the following: input data, output files, and reference DBs). You might want to check that to see if that might be a source of inefficiency.

Please keep me posted! Good luck!

On 9/3/18 7:18 PM, gcv wrote:

Well, I tried it. Used nearly the maximum available in ECS, 30GB of RAM, and 8 chromosomes at a time (effectively |make -j8|). That took about 7 hours, or only about twice as fast as imputing one chromosome at a time.

The way I did it was by setting |BEAGLE_LEASH_CHROMS| to values like |1 2 3 4 5 6 7 8| and passing 8 into the launch script. It certainly looks like it started 8 JVMs at a time, though I did not log into the container to check RAM or CPU use.

@taltman https://github.com/taltman, does that seem reasonable? It certainly looks like beagle-leash was doing the right things, but only twice the speedup for 8× the parallelism suggests an inefficiency somewhere.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/precisely/bioinformatics/issues/56#issuecomment-418221684, or mute the thread https://github.com/notifications/unsubscribe-auth/ACC-3MXyM-FMxno8u5ps89q9HTHrTHRXks5uXeMTgaJpZM4WV00w.

--

Encrypted email preferred. GPG Public Key: https://bit.ly/1S5qWZJ Key fingerprint = DFE8 7D60 D452 9C4F 5D1F 7515 F55F BB30 1719 7991

gcv commented 6 years ago

Cool. Thanks for confirming all that.

I can't find information on the storage used by ECS Fargate, but given it's completely ephemeral, I'm guessing it should be pretty fast.

After setting up the bioinformatics container with remote access, I kicked off another full imputation, and logged in. With the same 8 chromosomes at a time, it's using 5GB of physical RAM or less. On the other hand, all 4 detected (virtual?) cores are pegged at 100%.

The call to java on line 91 of run-beagle-pipeline.make sets no heap size, which should leave it at the default. I then ran this to try to figure out the JVM's defaults:

$ java -XX:+PrintFlagsFinal -version | grep 'HeapSize'
   uintx MaxHeapSize                              := 8044675072                          {product}

In human-readable numbers, that's ~8GB, which implies that each JVM process spawned by beagle-leash can allocate up to 8GB of RAM, which they're clearly not doing. As far as I can determine, each JVM process allocates 4-700MB of RAM.

This led me to try starting another container, this time running all imputations at once. That one sits pretty at <14GB of RAM: less than half the limit, and not the 3× increase I expected by going up from 8 imputations at a time.

Based on this evidence, I think one of the following must be true:

  1. The PrintFlagsFinal JVM flag is lying, and the actual max heap size is smaller than 8GB.
  2. Beagle is thrashing the young generation in a way that forces GC to use the heap inefficiently (see https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/sizing.html).
  3. Fargate has some weird IO characteristics affecting Beagle.
  4. Beagle is CPU-limited.

To do something about 1 and 2, beagle-leash needs to allow tuning JVM flags. One easy way to do this is to add support for a JAVA_OPTS environment variable and inject its value into the java ... command which starts Beagle.

3 is possible, but I'm not sure how to diagnose it. I'm not familiar with any tools good at keeping an eye on general system IO.

Personally, I think 4 is by far the most likely explanation. Which sucks, because speeding up the imputation significantly will require moving away from Fargate and dealing with spinning up huge EC2 instances dedicated to running the imputation containers.

taltman commented 6 years ago

I'm surprised to hear that your observing that beagle is not using much memory. I'll have to perform some test-runs on my end to see if I can reproduce the behavior I observed before.

Why are only four cores being utilized? How are you calling beagle-leash? If you have 4096 (!) cores at your disposal, why not bump up the third argument (specifying the number of cores) to 23? That should allow all chromosomes 1-22 & X to run in parallel. Perhaps that was a typo? I'm looking at the AWS Fargate docs now, and it seems that the max vCPUs for a container config is 4, which is in line with what you are saying.

What if you spun up six AWS Fargate containers, five with four chromosomes and one with three chromosomes? Would that be too complicated, not making it worth using AWS Fargate in the first place?

gcv commented 6 years ago

No typos. 4096 is Amazon weirdness meaning 4 virtual CPUs, which supposedly translates to 4 Xeon hyperthreads. See https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_definition_parameters.html, search for "Task Size".

The exact beagle-leash command I used is:

"${beagle_leash}" "${input_vcf_path}" "${output_imputed_vcf_path}-tmp.gz" ${num_processes}

The first three variables expand to various paths. ${num_processes} would have been 22 (excluding Y and MT). I can confirm that this started 22 java processes.

Furthermore, the two imputation runs I did overnight, both 8×3 and 22×1, took ~7 hours to complete.

I just kicked off another imputation with 2048 "cpu". If all this works as I expect, this task should take 14 hours to run. I'll post when I know more.

That said, it's entirely possible that the usual tools for monitoring CPU and memory lie when used in the Fargate environment. For example, the 2048 "cpu" task still reports 4 cores, except the process pegs them at 50% each, rather than 100% (at 4096). This is likely an artifact of the way hyperthreads are shared. But, the 2048 "cpu" also has a max RAM limit of 16GB, which is enforced — but /proc/cpuinfo still shows 32GB available. I don't have an explanation for that.

It is also possible that Beagle's behavior here has something to do with the bizarre Amazon+Fargate virtual CPU setup, and that it would behave differently on bare metal or a more traditional virtualized environment.

BTW, if you want to try running beagle in the environment I set up, I'll be happy to talk you through how to kick it off.

gcv commented 6 years ago

The imputation on a 2048 "cpu" container took just under 14 hours, or exactly twice as long as a 4096 "cpu" instance. So assuming the PrintFlagsFinal information about heap allocation is correct, Beagle is CPU-limited. At least in this environment.

See the CloudWatch log for the imputation.

aneilbaboo commented 5 years ago

The current processing time is acceptable. We can revisit this when we think that speeding this up is important.