very long processing time of running deepvariant on a HPC server

I am using deepvariant on our cluster using singularity. I could setup singularity and get it running. My problem now is that the process is pretty slow and it takes more than a day for just one sample. However I allocate more cpu core and memory does not help. Increasing the number of shard does not resolve it and even make it worse. Could someone please give me some clue on how to find it solution for this issue?

Hi @Asppagh ,

It can be many different reasons. Your machine setup definitely could be one of the factors. And, not all inputs will take the same amount of time to run. For example, some regions in some BAMs might take longer to realign, etc. In DeepVariant, we tried to empirically set some thresholds so we hope that even the slowest cases are not too slow. But it's always useful to learn from our users what edge cases might still cause the DeepVariant to be slow.

If your input BAM file is publicly sharable, you can also point us to it, and I'm happy to give it try and see if I can identify any reasons why it might be particularly slow.

But it's also possible that your data is not publicly available. If that's the case, to diagnose your machine setup, you can start by running on some of our publicly shared data used in https://github.com/google/deepvariant/blob/r1.1/docs/metrics.md

Specifically under: https://github.com/google/deepvariant/blob/r1.1/docs/metrics.md#how-to-reproduce-the-metrics-on-this-page

For example, you can run on our WGS BAM file: gs://deepvariant/case-study-testdata/HG003.novaseq.pcr-free.35x.dedup.grch38_no_alt.bam on your cluster using singularity , and see what runtime you're getting.

And, one more question that will help us provide better support: Do you know if make_examples finish running on your machine? If so, how long it took on how many cores? If make_examples finished, then what's the runtime on call_variants and postprocess_variants? (One possible issue we've seen before is that the call_variants stage is slow if users run on CPUs without acceleration)

Hi @pichuan, Thanks for the response. As our data is not publicly available, soI tried to use the benchmark data in the following link to get the better comparison. https://github.com/google/deepvariant/blob/r1.1/docs/deepvariant-exome-case-study.md In the link below, you mentioned about "sec per 100" on the log file is .67 sec for your hardware configuration. That should be proportionally adjustable on my machine's configuration (I use an 8 core machine and 64GB memory). https://github.com/google/deepvariant/issues/74

But unfortunately It is not the case on my machine. I got way higher time for different runs. from 20 second to some times 1 minutes per 100. Surprisingly singularity does not use the full memory, 64 GB made available to it. I am confused that if I set num_shards to the number of cores for example in my case 8, it makes the process even slower than when I set it to one. May I kindly ask how much memory normally singularity uses?

Thanks @Asppagh . Sorry it took me a while to get back to this.

From your response above, you're talking about the call_variants step. The issue (https://github.com/google/deepvariant/issues/74) you cited above was from a while ago, so the numbers might have changed. But I think it's still a good reference point. Even with 8 cores, going from < 1sec to 20-60 sec does seem extreme.

Let me try to get a 8-core, 64 GB machine and test it with Singularity and report back, so we can have a better comparison.

Other thing that come into mind is that call_variants step uses TensorFlow, which relies on CPU optimization to be faster. See: https://google.github.io/deepvariant/posts/2019-04-30-the-power-of-building-on-an-accelerating-platform-how-deepVariant-uses-intels-avx-512-optimizations/ Can you check what type of CPU you have? (What's in your /proc/cpuinfo?)

One more note on num_shards: num_shards affects the way that make_examples step is parallelized. On the high level it shouldn't affect call_variants step. But, setting num_shards to 8 means the output from make_examples will be in 8 shards (8 files), so it's possible that the input into call_variants become slower, but I don't expect a big difference. And, I would expect setting num_shards=1 would make make_examples step much slower.

Thanks @pichuan, I run on a cluster with different machines, which are pretty similar for example one of the machine that I use has 8 cores, each with the below cpu info:

cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 85 model name : Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz stepping : 4 microcode : 0x200004d cpu MHz : 2095.078 cache size : 16896 KB physical id : 0 siblings : 1 core id : 0 cpu cores : 1 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl tsc_reliable nonstop_tsc eagerfpu pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx hypervisor lahf_lm 3dnowprefetch arat bogomips : 4190.15 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management:

What I compared was not only call_variant it was make example step too. To make it clear I enclose a part of the log here, however it uses slightly different setting and different example but it shows what I said in my previous comment. In this case I did not specify any number of core for the cpu and the result are slightly better than if I specify the cpu cores equal to 8.

stdout of the process: input file S-001701867.markdup.bam I0622 13:05:17.760246 47710258629632 run_deepvariant.py:313] Creating a directory for intermediate results in /output/intermediate_results_dir I0622 13:05:17.867540 47710258629632 run_deepvariant.py:405] Creating a directory for logs in /output/logs I0622 13:05:17.933148 47710258629632 run_deepvariant.py:227] Creating a make_examples runtime by region directory in /output/logs/make_examples_runtime_by_region

* Intermediate results will be written to /output/intermediate_results_dir in docker.

Running the command: ( time seq 0 0 | parallel -q --halt 2 --line-buffer /opt/deepvariant/bin/make_examples --mode calling --ref "/ref/GRCh38_full_analysis_set_plus_decoy_hla.fa" --reads "/input/S-001701867.markdup.bam" --examples "/output/intermediate_results_dir/make_examples.tfrecord@1.gz" --runtime_by_region "/output/logs/make_examples_runtime_by_region/make_examples_runtime@1.tsv" --gvcf "/output

I0622 13:06:39.176360 47468847029248 genomics_reader.py:223] Reading /input/S-001701867.markdup.bam with NativeSamReader I0622 13:06:39.193307 47468847029248 make_examples.py:648] Preparing inputs I0622 13:06:39.256251 47468847029248 genomics_reader.py:223] Reading /input/S-001701867.markdup.bam with NativeSamReader I0622 13:06:40.155634 47468847029248 make_examples.py:648] Common contigs are ['chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr20', 'chr21', 'chr22', 'chrX', 'chrY', 'chrM'] I0622 13:06:53.526016 47468847029248 make_examples.py:648] Starting from v0.9.0, --use_ref_for_cram is default to true. If you are using CRAM input, note that we will decode CRAM using the reference you passed in with --ref 2021-06-22 13:06:53.527661: I third_party/nucleus/io/sam_reader.cc:662] Setting HTS_OPT_BLOCK_SIZE to 134217728 I0622 13:06:54.980819 47468847029248 genomics_reader.py:223] Reading /input/S-001701867.markdup.bam with NativeSamReader I0622 13:06:55.028589 47468847029248 genomics_reader.py:223] Reading /input/S-001701867.markdup.bam with NativeSamReader I0622 13:06:55.547895 47468847029248 make_examples.py:648] Writing examples to /output/intermediate_results_dir/make_examples.tfrecord-00000-of-00001.gz I0622 13:06:55.548219 47468847029248 make_examples.py:648] Writing gvcf records to /output/intermediate_results_dir/gvcf.tfrecord-00000-of-00001.gz I0622 13:06:56.858054 47468847029248 make_examples.py:648] Overhead for preparing inputs: 17 seconds I0622 13:06:59.468765 47468847029248 make_examples.py:648] 0 candidates (0 examples) [2.61s elapsed] I0622 13:07:29.596613 47468847029248 make_examples.py:648] 100 candidates (107 examples) [30.13s elapsed] I0622 13:07:35.695144 47468847029248 make_examples.py:648] 202 candidates (222 examples) [6.10s elapsed] I0622 13:07:43.305665 47468847029248 make_examples.py:648] 300 candidates (325 examples) [7.61s elapsed] I0622 13:07:47.552723 47468847029248 make_examples.py:648] 411 candidates (450 examples) [4.25s elapsed] I0622 13:07:50.840119 47468847029248 make_examples.py:648] 505 candidates (550 examples) [3.29s elapsed] I0622 13:07:56.351229 47468847029248 make_examples.py:648] 607 candidates (667 examples) [5.51s elapsed] I0622 13:08:05.931827 47468847029248 make_examples.py:648] 701 candidates (761 examples) [9.58s elapsed] I0622 13:08:13.084163 47468847029248 make_examples.py:648] 801 candidates (865 examples) [7.15s elapsed] I0622 13:08:28.955715 47468847029248 make_examples.py:648] 900 candidates (968 examples) [15.87s elapsed] I0622 13:08:45.557140 47468847029248 make_examples.py:648] 1000 candidates (1072 examples) [16.60s elapsed] I0622 13:09:13.513435 47468847029248 make_examples.py:648] 1102 candidates (1203 examples) [27.96s elapsed] I0622 13:09:22.751119 47468847029248 make_examples.py:648] 1201 candidates (1310 examples) [9.24s elapsed] I0622 13:09:33.049602 47468847029248 make_examples.py:648] 1301 candidates (1416 examples) [10.30s elapsed] I0622 13:10:47.812749 47468847029248 make_examples.py:648] 1407 candidates (1532 examples) [74.76s elapsed] I0622 13:10:54.033380 47468847029248 make_examples.py:648] 1500 candidates (1642 examples) [6.22s elapsed] I0622 13:11:08.331253 47468847029248 make_examples.py:648] 1601 candidates (1758 examples) [14.11s elapsed] I0622 13:11:50.602126 47468847029248 make_examples.py:648] 1700 candidates (1868 examples) [42.27s elapsed] I0622 13:12:51.198980 47468847029248 make_examples.py:648] 1800 candidates (1974 examples) [60.60s elapsed] I0622 13:13:41.480815 47468847029248 make_examples.py:648] 1901 candidates (2086 examples) [50.28s elapsed] I0622 13:14:08.031210 47468847029248 make_examples.py:648] 2004 candidates (2199 examples) [26.55s elapsed] I0622 13:14:17.643794 47468847029248 make_examples.py:648] 2100 candidates (2297 examples) [9.61s elapsed] I0622 13:14:53.920969 47468847029248 make_examples.py:648] 2201 candidates (2413 examples) [36.28s elapsed] I0622 13:15:11.488110 47468847029248 make_examples.py:648] 2302 candidates (2514 examples) [17.57s elapsed] I0622 13:16:23.523479 47468847029248 make_examples.py:648] 2400 candidates (2618 examples) [72.04s elapsed] I0622 13:16:32.704340 47468847029248 make_examples.py:648] 2507 candidates (2731 examples) [9.18s elapsed] I0622 13:16:48.337418 47468847029248 make_examples.py:648] 2600 candidates (2830 examples) [15.63s elapsed] I0622 13:16:54.819246 47468847029248 make_examples.py:648] 2703 candidates (2939 examples) [6.48s elapsed] I0622 13:16:57.243167 47468847029248 make_examples.py:648] 2803 candidates (3052 examples) [2.42s elapsed] I0622 13:17:06.981576 47468847029248 make_examples.py:648] 2900 candidates (3157 examples) [9.74s elapsed] I0622 13:17:19.268150 47468847029248 make_examples.py:648] 3005 candidates (3262 examples) [12.29s elapsed] I0622 13:17:22.910652 47468847029248 make_examples.py:648] 3105 candidates (3362 examples) [3.64s elapsed] I0622 13:17:29.084241 47468847029248 make_examples.py:648] 3203 candidates (3462 examples) [6.17s elapsed] I0622 13:17:41.754681 47468847029248 make_examples.py:648] 3304 candidates (3565 examples) [12.67s elapsed] I0622 13:18:01.142505 47468847029248 make_examples.py:648] 3408 candidates (3678 examples) [19.39s elapsed] I0622 13:18:50.555421 47468847029248 make_examples.py:648] 3501 candidates (3777 examples) [49.41s elapsed] I0622 13:19:02.680837 47468847029248 make_examples.py:648] 3605 candidates (3885 examples) [12.13s elapsed] I0622 13:19:49.162517 47468847029248 make_examples.py:648] 3701 candidates (3987 examples) [46.48s elapsed] I0622 13:20:18.441009 47468847029248 make_examples.py:648] 3802 candidates (4094 examples) [29.28s elapsed] I0622 13:20:31.614595 47468847029248 make_examples.py:648] 3900 candidates (4198 examples) [13.17s elapsed] I0622 13:20:40.005722 47468847029248 make_examples.py:648] 4001 candidates (4303 examples) [8.39s elapsed] I0622 13:21:18.061423 47468847029248 make_examples.py:648] 4100 candidates (4415 examples) [37.98s elapsed] I0622 13:21:48.321910 47468847029248 make_examples.py:648] 4206 candidates (4527 examples) [30.26s elapsed] I0622 13:22:06.011651 47468847029248 make_examples.py:648] 4300 candidates (4633 examples) [17.69s elapsed] I0622 13:22:48.510961 47468847029248 make_examples.py:648] 4402 candidates (4744 examples) [42.50s elapsed] I0622 13:23:11.198972 47468847029248 make_examples.py:648] 4500 candidates (4863 examples) [22.69s elapsed] I0622 13:23:26.363482 47468847029248 make_examples.py:648] 4600 candidates (4968 examples) [15.16s elapsed] I0622 13:23:57.098380 47468847029248 make_examples.py:648] 4703 candidates (5073 examples) [30.73s elapsed] I0622 13:25:04.580688 47468847029248 make_examples.py:648] 4804 candidates (5184 examples) [67.48s elapsed] I0622 13:26:17.576950 47468847029248 make_examples.py:648] 4900 candidates (5292 examples) [73.00s elapsed] I0622 13:26:47.980118 47468847029248 make_examples.py:648] 5000 candidates (5409 examples) [30.40s elapsed] I0622 13:27:24.147203 47468847029248 make_examples.py:648] 5107 candidates (5546 examples) [36.17s elapsed] I0622 13:28:18.867597 47468847029248 make_examples.py:648] 5201 candidates (5651 examples) [54.72s elapsed] I0622 13:29:07.534196 47468847029248 make_examples.py:648] 5302 candidates (5787 examples) [48.67s elapsed] I0622 13:29:54.694571 47468847029248 make_examples.py:648] 5400 candidates (5891 examples) [47.16s elapsed] I0622 13:30:13.158172 47468847029248 make_examples.py:648] 5502 candidates (6000 examples) [18.46s elapsed] I0622 13:31:08.142216 47468847029248 make_examples.py:648] 5601 candidates (6123 examples) [54.98s elapsed] I0622 13:31:37.655090 47468847029248 make_examples.py:648] 5700 candidates (6226 examples) [29.51s elapsed] I0622 13:32:15.005650 47468847029248 make_examples.py:648] 5805 candidates (6333 examples) [37.35s elapsed] I0622 13:33:04.347265 47468847029248 make_examples.py:648] 5900 candidates (6430 examples) [49.34s elapsed]

I get a 8 VCPUs, 52 GB RAM n1-highmem-8 machine to test:

gcloud compute instances create "${USER}-test-speed" \
--scopes "compute-rw,storage-full,cloud-platform" \
--image-family "ubuntu-1804-lts" \
--image-project "ubuntu-os-cloud" \
--machine-type "n1-highmem-8" \
--zone "us-west2-b" \
--boot-disk-size 100G \
--min-cpu-platform "Intel Skylake"

On the machine, I installed Singularity:

curl https://gist.githubusercontent.com/pichuan/7840c8ba80ad31fee9d6f8bea20edb6a/raw/cbf62eb2ea2f141351801db76781d99d04704b4e/install_singularity_3.7.0.sh | \
  sed -e s'|github.com/sylabs|github.com/hpcng|' | \
  bash -x

Here's the version:

pichuan@pichuan-test-speed:~$ singularity --version
singularity version 3.7.0

I followed: https://github.com/google/deepvariant/blob/r1.1/docs/deepvariant-exome-case-study.md to download the data.

And then:

mkdir -p output
mkdir -p output/intermediate_results_dir

# Pull the image.
BIN_VERSION=1.1.0
singularity pull docker://google/deepvariant:"${BIN_VERSION}"

# Run DeepVariant.
singularity run -B /usr/lib/locale/:/usr/lib/locale/ \
  docker://google/deepvariant:"${BIN_VERSION}" \
  /opt/deepvariant/bin/run_deepvariant \
  --model_type WES \
  --ref reference/GRCh38_no_alt_analysis_set.fasta \
  --reads input/HG003.novaseq.wes_idt.100x.dedup.bam \
  --regions input/idt_capture_novogene.grch38.bed \
  --output_vcf output/HG003.output.vcf.gz \
  --output_gvcf output/HG003.output.g.vcf.gz \
  --num_shards $(nproc) \
  --intermediate_results_dir output/intermediate_results_dir 2>&1 | tee /tmp/all.log

I'll paste part of the log of each step so that you can compare.

make_examples

make_examples speed is roughly:

I0622 21:19:25.373434 140610510067456 make_examples.py:648] Task 7/8: 4900 candidates (5187 examples) [27.99s elapsed]
I0622 21:19:35.260825 139809239041792 make_examples.py:648] Task 1/8: 4809 candidates (5065 examples) [32.79s elapsed]
I0622 21:19:37.868103 139727062120192 make_examples.py:648] Task 2/8: 4900 candidates (5208 examples) [37.92s elapsed]
I0622 21:19:37.739557 139786800707328 make_examples.py:648] Task 6/8: 5100 candidates (5441 examples) [29.08s elapsed]
I0622 21:19:44.484720 140667007305472 make_examples.py:648] Task 5/8: 4902 candidates (5241 examples) [37.78s elapsed]

Here are the last few lines from the log:

I0622 21:24:34.005878 140667007305472 make_examples.py:648] Task 5/8: Created 6240 examples
I0622 21:24:38.061186 139897026688768 make_examples.py:648] Task 4/8: 5906 candidates (6318 examples) [17.72s elapsed]
I0622 21:24:43.683619 140528910345984 make_examples.py:648] Task 0/8: 5700 candidates (6127 examples) [24.04s elapsed]
I0622 21:24:44.784906 139897026688768 make_examples.py:648] Task 4/8: 6002 candidates (6422 examples) [6.72s elapsed]
I0622 21:24:46.344424 139897026688768 make_examples.py:648] Task 4/8: Found 6004 candidate variants
I0622 21:24:46.344626 139897026688768 make_examples.py:648] Task 4/8: Created 6424 examples
I0622 21:24:58.252706 140528910345984 make_examples.py:648] Task 0/8: 5800 candidates (6229 examples) [14.57s elapsed]
I0622 21:25:03.072478 140528910345984 make_examples.py:648] Task 0/8: Found 5825 candidate variants
I0622 21:25:03.072677 140528910345984 make_examples.py:648] Task 0/8: Created 6254 examples

Here is a snapshot of htop while make_examples is running:

  1  [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]   5  [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]
  2  [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]   6  [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]
  3  [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]   7  [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]
  4  [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]   8  [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]
  Mem[|||||||||||||||||||||||||||||||||                      5.45G/51.0G]   Tasks: 67, 77 thr; 8 running
  Swp[                                                             0K/0K]   Load average: 8.09 7.59 4.84 
                                                                            Uptime: 01:04:14

make_examples took:

real    32m36.929s
user    253m55.294s
sys     1m1.715s

call_variants

At the beginning of call_variants, I see:

2021-06-22 21:25:07.709510: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-06-22 21:25:07.731210: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2000170000 Hz
2021-06-22 21:25:07.731675: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5e87820 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-06-22 21:25:07.731713: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2021-06-22 21:25:07.734891: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.

which confirms that I'm using AVX optimization.

The log for call_variants is pretty short, because WES has fewer examples to run on. My call_variants log look like this:

I0622 21:25:17.009006 140301916206848 saver.py:1293] Restoring parameters from /opt/models/wes/model.ckpt
I0622 21:25:24.567713 140301916206848 call_variants.py:454] Processed 1 examples in 1 batches [1678.300 sec per 100]
I0622 21:26:59.442872 140301916206848 call_variants.py:454] Processed 15001 examples in 30 batches [0.744 sec per 100]
I0622 21:28:34.156948 140301916206848 call_variants.py:454] Processed 30001 examples in 59 batches [0.688 sec per 100]
I0622 21:30:08.158901 140301916206848 call_variants.py:454] Processed 45001 examples in 88 batches [0.667 sec per 100]
I0622 21:30:37.846297 140301916206848 call_variants.py:458] Processed 49760 examples in 98 batches [0.663 sec per 100]
I0622 21:30:37.846524 140301916206848 call_variants.py:461] Done calling variants from a total of 49760 examples.

real    5m34.074s
user    32m1.122s
sys     0m32.211s

Note that the CPU usage for call_variants seems to vary more than make_examples. Not all 8 CPUs are at 100% all the time, but I did observe that all 8 CPUs were used. RAM was also used that much (maybe about 5GB when I look). I didn't take a snapshot of htop here.

postprocess_variants

The log here is pretty short, so I'll paste below:

***** Running the command:*****
( time /opt/deepvariant/bin/postprocess_variants --ref "reference/GRCh38_no_alt_analysis_set.fasta" --infile "output/intermediate_results_dir/call_variants_output.tfrecord.gz" --outfile "output/HG003.output.vcf.gz" --nonvariant_site_tfrecord_path "output/intermediate_results_dir/gvcf.tfrecord@8.gz" --gvcf_outfile "output/HG003.output.g.vcf.gz" )

I0622 21:30:41.675962 140597045393152 postprocess_variants.py:1061] Using sample name from call_variants output. Sample name: HG003
2021-06-22 21:30:41.680945: I deepvariant/postprocess_variants.cc:88] Read from: output/intermediate_results_dir/call_variants_output.tfrecord.gz
2021-06-22 21:30:42.033617: I deepvariant/postprocess_variants.cc:103] Total #entries in single_site_calls = 49760
I0622 21:30:42.624573 140597045393152 postprocess_variants.py:1120] CVO sorting took 0.015735054016113283 minutes
I0622 21:30:42.625385 140597045393152 postprocess_variants.py:1122] Transforming call_variants_output to variants.
I0622 21:30:42.625963 140597045393152 postprocess_variants.py:1149] Merging and writing variants to VCF and gVCF.
I0622 21:30:42.627370 140597045393152 genomics_writer.py:176] Writing output/HG003.output.vcf.gz with NativeVcfWriter
I0622 21:30:42.628537 140597045393152 genomics_writer.py:176] Writing output/HG003.output.g.vcf.gz with NativeVcfWriter
I0622 21:31:27.396288 140597045393152 postprocess_variants.py:1169] Finished writing VCF and gVCF in 0.7461712757746378 minutes.
I0622 21:31:27.415654 140597045393152 genomics_reader.py:223] Reading output/HG003.output.vcf.gz with NativeVcfReader

real    0m53.037s
user    0m47.191s
sys     0m6.700s

@Asppagh Can you check whether this is inline with what you're seeing? it's expected for make_examples step to run for about 30+min on a 8core instance.

If anything above differs significantly from what you're seeing. Let me know.

Hi @pichuan, Make example real 44m32.310s user 292m40.886s sys 3m37.117s

Call variant real 15m54.223s user 63m3.052s sys 0m45.200s

Post process real 18m19.080s user 0m45.307s sys 0m5.393s

I ran multiple times with different machines that is in our cluster, This is the best runtime that I get is here, but this is on 32 cores machine, however I believe number of the core does not change the run time significantly, plus ram usage also is very low, less than 8 GB. Could you please clarify a bit for me in each step(ram/ cpu) what can expedite the process? in this case, results comes in less than 2 hours, but in our real dataset it takes 1-2 day to get the result for each sample. Do you have any idea how I can get a reasonable time less than 12 hours for example? I. appreciate your help.

Hi @Asppagh , from your numbers, it seems like your make_examples isn't too different from my run.

I'm most surprised by your postprocess_variants. It seems like your user time of postprocess_variants is similar to mine (less than a minute), however the real wall time is 18min? Given that I don't know enough about your system setting, any guess why this would be the case? On a high level I wonder if most of the time are somehow blocked in file I/O, but it still seems extreme.

Hi @Asppagh I'm closing this issue given that I haven't had more information from you. If you have more information to help us reproduce this issue, please feel free to reopen!

Hi @pichuan, Make example real 44m32.310s user 292m40.886s sys 3m37.117s

Call variant real 15m54.223s user 63m3.052s sys 0m45.200s

Post process real 18m19.080s user 0m45.307s sys 0m5.393s

I ran multiple times with different machines that is in our cluster, This is the best runtime that I get is here, but this is on 32 cores machine, however I believe number of the core does not change the run time significantly, plus ram usage also is very low, less than 8 GB. Could you please clarify a bit for me in each step(ram/ cpu) what can expedite the process? in this case, results comes in less than 2 hours, but in our real dataset it takes 1-2 day to get the result for each sample. Do you have any idea how I can get a reasonable time less than 12 hours for example? I. appreciate your help.

I am kinda facing the same problem was wondering if you found a solution for this.

best!

google / deepvariant