debugging straggling shards

google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.

BSD 3-Clause "New" or "Revised" License

3.2k stars 721 forks source link

debugging straggling shards #105

Closed mclaugsf closed 6 years ago

mclaugsf commented 6 years ago

I just started using DeepVariant 0.7.0 and I have gotten it to complete on a few exome runs. Out of the 4 exome runs I tested, I used 64 shards for the make_examples step. For 3 of the exomes, the make_examples step seemed to take about 10-15 minutes per shard. For a 4th exome, the make_examples step for one of the shards was taking much longer than 10-15 minutes; after 14 hours it was still running and I manually killed it. Now I have moved onto a whole-genome sequencing run for testing and the same thing is happening; 63/64 shards complete in about 1hour, but a straggling job has been running for 22 hours now.

Have you run into this problem before and do you have suggestions for debugging?
It occurred to me that there could be some ultra-high coverage region in my BAM file that could be slowing things down. But, I am looking for a way to investigate this. Is there an easy way to determine which region of the genome is in a particular shard?

pichuan commented 6 years ago

Hi, we have seen regions that run for longer. Up to 14 or 22 hours still sound like longer than what I personally have seen before. If you look into the log files of the make_examples shard (under $HOME/case-study/output/logs/1/), you can see log like this:

I0815 16:59:22.392369 139972404860672 make_examples.py:825] Found 0 candidates in 1:2392001-2393000 [1000 bp] [0.08s elapsed]
I0815 16:59:22.490614 139972404860672 make_examples.py:825] Found 1 candidates in 1:2456001-2457000 [1000 bp] [0.10s elapsed]
I0815 16:59:22.616637 139972404860672 make_examples.py:825] Found 4 candidates in 1:2520001-2521000 [1000 bp] [0.12s elapsed]

If you find where it stopped, you can usually use the pattern of regions to find out which one it got stuck on. Let me know if that helps. I'll also discuss with the team to see if we can print out more incremental logs in cases like this to help understand what the code is doing.

mclaugsf commented 6 years ago

Thank you - this ended up being a problem on our end with the way we were restarting failed jobs; our system was failing to recognize that this job had failed and was not properly restarting this as it should be.

I am now running this on a whole genome and all the shards complete in about 1hr, but call_variants has been running for 24hrs. Looking over the case study here https://github.com/google/deepvariant/blob/r0.7/docs/deepvariant-case-study.md I am expecting it to take 3-4hrs. In the case study it looks like it was run on 1 CPU. When I noticed this argument:

--num_readers: Number of parallel readers to create for examples.
    (default: '8')
    (an integer)

I restarted it with 8 CPU (it looks like it actually uses 6). But, does this argument actually control threading? As my next move I was planning to run call_variants on each shard from make_examples. Thanks for any tips/advice you may have.

pgrosu commented 6 years ago

The short answer is yes. That's because that parameter gets fed to prepare_inputs(), which generates a DeepVariantInput object that generates a tf.data.Dataset, which gets fed in your case into an Estimator it constructs for you, that will then will be processed via predict() which sets up a MonitoredSession which by default each Session is thread-safe. I believe the rest get used to monitor the sessions, but I would need to double-check on that. Turtles all the way down :)

Hope it helps, ~p

pichuan commented 6 years ago

@mclaugsf Thanks for the update! 1) In terms of failed jobs -- we also noticed that our current recommendation of case studies tend to mask the issues if a run failed in the middle, because we currently pipe all output to log files. We're making some changes so that if anything fails in the middle, it'll be more clear to the users later. I'm still fixing a few more things, hopefully it'll come out in the next release. For now it's a good idea to just check the log files to make sure previous runs were successfully done before proceeding.

2) For call_variants, can you check your call_variants.log file and see what the lines look like? In my run for the WGS casestudy, it converges to something like:

I0815 18:49:08.438520 140611550078720 call_variants.py:359] Processed 113665 examples in 223 batches [0.222 sec per 100]
I0815 18:49:09.491303 140611550078720 call_variants.py:359] Processed 114177 examples in 224 batches [0.222 sec per 100]
I0815 18:49:10.535501 140611550078720 call_variants.py:359] Processed 114689 examples in 225 batches [0.221 sec per 100]

In our case study, we recommend just running one call_variants per machine. call_variants itself does utilize multiple CPUs now, so if you use top or htop to check your run, you should see that it uses more than one CPU.

In my previous experience, running multiple call_variants on the same machine tends to make the run slower. If you're running call_variants separately on each shard, and if you can do each of them on different machines, that's probably most ideal. But if you plan to try running multiple call_variants on the same machine, you might want to watch out the speed because it will likely not be linearly faster. (If you find otherwise, let me know. I haven't tried it myself for a while now)

pichuan commented 6 years ago

It seems like original question is resolved for now. If there are more questions for call_variants, feel free to open another issue. I think it's also possible to do with whether Intel MKL is available or not on your machine. If your speed reported in the call_variants step is much slower, feel free to open another issue and we can discuss there.

mclaugsf commented 6 years ago

Thanks, just to followup: my call_variants step eventually finished after 1d2h31m. The last lines in my call_variants.log file read:

2018-10-17T14:30:33.396510159Z statfs 424901734400 available 4378992640 used 429280727040 total -- interval 10.0000 seconds 0 used
2018-10-17T14:30:33.786897949Z I1017 14:30:33.786412 140161207068416 call_variants.py:359] Processed 10551585 examples in 329738 batches [0.904 sec per 100]
2018-10-17T14:30:34.069377504Z I1017 14:30:34.068934 140161207068416 call_variants.py:359] Processed 10551617 examples in 329739 batches [0.904 sec per 100]
2018-10-17T14:30:34.164880374Z I1017 14:30:34.164383 140161207068416 call_variants.py:359] Processed 10551649 examples in 329740 batches [0.904 sec per 100]
2018-10-17T14:30:34.166325693Z I1017 14:30:34.166042 140161207068416 call_variants.py:361] Done evaluating variants

depristo commented 6 years ago

Hi @mclaugsf,

Let me give you a bit more context here on the runtime of call_variants. call_variants is the deep learning component of DeepVariant, so it relies on TensorFlow to execute the inception_v3 model used to evaluate our genotype likelihoods. In the 0.7 case study, make_examples creates 5,847,041 genomic tensors that need to be evaluated. When executing using CPUs, TensorFlow by default uses all of the available cores on the machine. So in our case study, which runs on a 64 core machine, we are using all 64 cores to evaluate these tensors.

So a rough estimate of the core-hours needed for the DeepVariant WGS case is:

64 cores * 205 minutes of runtime ~= 219 core hours ~= 9 days

So if you are running on a machine with a single core, you should see call_variants take ~9 days. This is a bit of an over-estimate because 64 cores isn't 64x more efficient than 1 core.

Based on your 1 day turn around I'd guess you are running on a machine with 8 cores.

Note these numbers assume you are using a modern CPU with AVX etc instruction sets. Not having those can increase the runtime by ~4x or so.

Also I want to ask - in your original post are you processing exomes? If so, are you providing a capture regions bed to make_examples? Normally an exome produces < 100k examples (contrast that with 5.8M in a whole genome) so the runtime should be 60x less on an exome. That means instead of 9 days on a single core you are looking at 3.5 hours.

mclaugsf commented 6 years ago

When executing using CPUs, TensorFlow by default uses all of the available cores on the machine. So in our case study, which runs on a 64 core machine, we are using all 64 cores to evaluate these tensors.

Thank you! I didn't understand there was a component of this that was automatically determining the number of cores I was using and making use of them. Yes, I was using 8 cores.

Also I want to ask - in your original post are you processing exomes? If so, are you providing a capture regions bed to make_examples?

Yes, I am currently evaluating DeepVariant on Exomes and WGS samples. I did indeed provide a regions bed to make_examples. My original issue I was asking about was a shard that seemed to be running forever (I broke up each shard for make_examples onto its own single core w/2Gb memory). But, this was probably an issue with the instance I was using for that particular shard and the way our system was mishandling the restart. I can get the exomes to run through DeepVariant reliably now.