Shared library from the Exome Case Study requires AVX support -- not all users might have this

pgrosu commented 6 years ago

Hi,

I was going through the Exome Case Study, and noticed that I was not getting any TFRecord formatted files from the Run make_examples step. I then proceeded to dig deeper, and I'm listing my debugging steps here in case it might help others. The gist of it is that the common shared libraries that are part of the zip files (from the Google Storage location) are built with AVX-support, which not everyone might have support for with their CPU. It would be great if they were compiled with the bare-minimum of CPU qualities, to guarantee they will work on most users' machines. In any case, below is my analysis:

$ PYTHONPATH=. /usr/bin/python deepvariant/make_examples.py --mode calling --ref /home/paul/exome-case-study/input/data/hs37d5.fa.gz --reads /home/paul/exome-case-study/input/data/151002_7001448_0359_AC7F6GANXX_Sample_HG002-EEogPU_v02-KIT-Av5_AGATGTAC_L008.posiSrt.markDup.bam --examples /home/paul/exome-case-study/output/HG002.examples.tfrecord@1.gz --regions /home/paul/exome-case-study/input/data/refseq.coding_exons.b37.extended50.bed --task 0
Illegal instruction (core dumped)
$
$ mkdir make-examples && cd make-examples
$ cd unzip ~/exome-case-study/input/bin/make_examples.zip
$ cd runfiles/genomics
$
$ PYTHONPATH=. /usr/bin/python deepvariant/make_examples.py --mode calling --ref /home/paul/exome-case-study/input/data/hs37d5.fa.gz --reads /home/paul/exome-case-study/input/data/151002_7001448_0359_AC7F6GANXX_Sample_HG002-EEogPU_v02-KIT-Av5_AGATGTAC_L008.posiSrt.markDup.bam --examples /home/paul/exome-case-study/output/HG002.examples.tfrecord@1.gz --regions /home/paul/exome-case-study/input/data/refseq.coding_exons.b37.extended50.bed --task 0
Illegal instruction (core dumped)
$

After digging a bit deeper, I noticed that loading the pileup_image_native module was causing this issue. I was curious and looked at the assembly instructions:

$ gdb -ex r --args python -c "from deepvariant.python import pileup_image_native"
Program received signal SIGILL, Illegal instruction.
0x00007ffff5d308b4 in google::protobuf::DescriptorPool::Tables::Tables() ()
   from /home/paul/make-examples/runfiles/genomics/deepvariant/python/../../_solib_k8/libexternal_Sprotobuf_Uarchive_Slibprotobuf.so
(gdb) disassemble $pc,$pc+32
Dump of assembler code from 0x7ffff5d308b4 to 0x7ffff5d308d4:
=> 0x00007ffff5d308b4 <_ZN6google8protobuf14DescriptorPool6TablesC2Ev+676>:     vpxor  %xmm0,%xmm0,%xmm0
   0x00007ffff5d308b8 <_ZN6google8protobuf14DescriptorPool6TablesC2Ev+680>:     lea    0x1b0(%rbx),%rax
   0x00007ffff5d308bf <_ZN6google8protobuf14DescriptorPool6TablesC2Ev+687>:     movl   $0x0,0x1b0(%rbx)
   0x00007ffff5d308c9 <_ZN6google8protobuf14DescriptorPool6TablesC2Ev+697>:     movq   $0x0,0x1b8(%rbx)
End of assembler dump.
(gdb)

I noticed the vpxor instruction, which made me wonder if my CPU is enabled for AVX, so I proceeded as follows:

$ grep flags /proc/cpuinfo
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc pni pclmulqdq monitor ssse3 cx16 sse4_1 sse4_2 x2apic popcnt aes hypervisor lahf_lm
$

This confirmed for me that I don't have AVX support. So it would be great for the examples that will drive usage and be used by many users to learn from, if the provided libraries are compiled with the bare-minimum of CPU qualities. I think it'll make it a bit easier for many users to adopt this nice pipeline.

Thanks, Paul

pichuan commented 6 years ago

Hi Paul, thanks for mentioning this issue. I looked at our documentation and noticed that an update was made to our README a few days ago with this extra description:

Pre-built binaries are available at gs://deepvariant/. These are compiled to use SSE4 and AVX instructions, so you'll need a CPU (such as Intel Sandy Bridge) that supports them. (The file /proc/cpuinfo lists these features under "flags".)

But it seems like this new information to the doc hasn't be synced to the external GitHub yet. This should come out in the new year at the latest.

I suspect we'd like to keep the pre-built binary having optimization. But we will at least add that line of disclaimer so that it's clear what the binaries are built for. Would it be ok for you to build DeepVariant for your CPU by following Building and testing DeepVariant, or do you need pre-built binaries without AVX from us?

pgrosu commented 6 years ago

Hi Pichuan,

That's a great idea to add it to the README, as it's probably the first thing users see. For those who might miss noticing its importance in the README, it probably would not hurt adding an assert statement to the GCS zip-specific make_examples.py, for the appropriate flags in /proc/cpuinfo with a gentle commented termination. I also just noticed it with a search here as well:

https://github.com/google/deepvariant/blob/r0.4/docs/deepvariant-release-notes.md#040

Thank you for the offer regarding the customized binaries, but it was just a few minor changes and I got working now. I was just mentioning it in case others might run into that issue, and would wonder why it exited.

Thanks, `p

pichuan commented 6 years ago

Thanks Paul. I added an internal bug to track your suggestion. DeepVariant still has quite a lot error messages that have room for improvement. Thanks for reporting this to us so we can itemize them and improve over time.

pgrosu commented 6 years ago

Hi Pichuan,

It's a team effort :) Time is on our side as it's still sub-1.0. I think you can make it much easier and fun for yourself, as there another level to NGS analysis that can be tapped here. Currently the code sort of a wrapper to TensorFlow, and much of the definition of the input feature-set changed to the [0,254]-ranged 7 ImageRow channels -- [base, base_quality, ..., matches_ref, op_len] -- which is not really a RGB image anymore, and then run through the model to emit the three classes of predicted GT probabilities (homozygous reference, heterozygous and homozygous alternative):

genotype_probabilities: 0.9999428988
genotype_probabilities: 1.8287e-05
genotype_probabilities: 3.88142e-05

The VCF saving is helpful, but the interesting part is abstracting out the data and functionals (i.e. models), to enable a larger analysis platform, rather than still be file-focused which (dynamic) datasets would allow. Maybe tomorrow you want to try Inception-v4 or some other more expressive network topology, which might require changes to your input tensor. I suspect things will become more of an ensemble model approach in the future, where you you want to push your data through multiple pipelines for comparison. At that point you're not even writing code anymore, but rather directing transformation among (ephemeral) datasets through an intermediate analysis language (DSL) to try out different ideas upon a collection of data -- where some might contain/become golden standards. For example, now you have collections of transformed datasets streamlined as protobuf message, which can be filter-inspected as JSON as necessary for validation purposes, which you can do now such as for CallVariantsOutput among others:

variant {
  reference_bases: "A"
  alternate_bases: "C"
  calls {
    info {
      key: "AD"
      ...
      call_set_name: "Sample_Diag-excap51-HG002-EEogPU"
  }
  end: 1115835
  reference_name: "1"
  start: 1115834
  ...

So what I'm humbly proposing is something more subtle and flexible. Imagine you open a genome browser, and search for a specific disease. It then can have you drill down to the available samples or the variation graph. For the variation graph it will allow you to see a model of transitions among samples relating the temporal behavior of subtypes of cancer. All these steps are multiple pipelines that are auto-triggered to run accordingly, or would be pre-cached results. As new samples come in, it improves the disease models, or other functional analysis. This opens doors to Personlized Medicine -- clinicians, research, or patients -- enabling integrated analysis that is centralized for different types of datasets. You already have the protobuf data-structures for initiating that, thus allowing your stored collection of pipelines to drive that integration and comparison to take place, of which DeepVariant would be one of many. Thus flexibility in the form of modularity when refactoring the codebase will allow for more fluid hierarchical analysis to take place down the line.

Paul

depristo commented 6 years ago

I'm closing this issue because we aren't likely to provide prebuilt binaries without AVX instructions. One reason is that the AVX instructions are critical to efficiently evaluate our deep learning model. Another is that TensorFlow itself will soon provide prebuilt binaries with AVX instructions (https://github.com/tensorflow/tensorflow/releases).

Users who need to run DeepVariant on pre-AVX instruction chipsets should build DeepVariant from sources.

pgrosu commented 6 years ago

Hi Mark,

I understand, but it would be nice to dig deeper in understanding under what circumstances is the ~20% improvement being observed, as there might be bigger optimization opportunities. If you add the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) optimizations, that will beat AVX by 50% and AVX2 by 27% (using an Inception v3 model) as shown here - I reshaped the table for batch size of 32 as it's not properly formatted on that TensorFlow page:

Optimization	Data Format	Images/Sec (step time)	Intra threads	Inter Threads
MKL	NCHW	10.3 (3,104ms)	4	1
AVX2	NHWC	7.5 (4,255ms)	4	0
AVX	NHWC	5.1 (6,275ms)	4	0
SSE3	NHWC	2.8 (11,428ms)	4	0

Yes, Tensorflow is targetting AVX pre-builds for two versions from now (1.6), but that I why I was suggesting modularity for DeepVariant as a workflow/pipeline where users can plug-in other frameworks and/or models besides just Tensorflow (i.e. MXNet, Caffe, etc) and Inception/Mobilenet/ResNet50/etc. The more flexible and ease-of-entry the analysis pipeline is, the faster its community will grow while minimizing the support requirements.

Best, Paul

depristo commented 6 years ago

Actually, what you are asking for I believe partially exists w.r.t. TensorFlow and our models. If you look at modeling.py you can see we aren't tied to InceptionV3 directly. We have trained DV models with mobilenet and ResNet, for example.

And we aren't tied to a specific version of TensorFlow. You can build your own optimized version of TensorFlow from scratch with whatever configuration options you like (AVX, AVX2, MKL) and use that to get all of the performance benefit of your native chipset. I suspect that production-grade deployments of DeepVariant will use custom built, optimized TensorFlow wheels to maximize performance. In fact, it would be possible, with sufficiently advanced orchestration capabilities, to manage a family of TensorFlow wheels and conditionally load the maximally-performant version for the actual machine you are running on, if you are running on a heterogenous fleet (common both in cloud and on-prem).

Our challenge in providing a prebuilt binary is that we needed to come up with a reasonable default set of optimization flags, and requiring at least AVX seems reasonable to us given the gap (not on your chart) in performance between with and without AVX itself.

As for decoupling from TensorFlow, I can tell you that we will not devote any cycles to that effort ourselves. Perhaps others would be willing to do so, though.

AndrewCarroll commented 5 years ago

Adding an acknowledgement that this discussion eventually resulted in improvements incorporated to DeepVariant described in the TensorFlow blog (https://medium.com/tensorflow/the-power-of-building-on-an-accelerating-platform-how-deepvariant-uses-intels-avx-optimizations-c8f0acb62344)

Mark's proposal for the method to incorporate AVX improvements is a quite accurate description of what was implemented.

pgrosu commented 5 years ago

Thank you for the acknowledgement, but more importantly as scientists we require that the experiment be complete by reflecting equivalence in the results. Let's dig a little deeper:

In the article you are right with AVX-512 would give you the ability to "operate on more information at once", so have you tried a test where you compiled DeepVariant with just -mavx512* without MKL? Let's look at the following article:

https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture

The increased throughput (though significant) via vectorized functions is only one aspect of the optimizations. I would suspect you picked MKL for multiple optimization reasons, one of which performs auto-queries for code path dispatches to save space on multiple binaries for users (among many other reasons):

https://software.intel.com/en-us/mkl-linux-developer-guide-instruction-set-specific-dispatching-on-intel-architectures

Yes Mark's proposal is accurate with AVX, but try running with just AVX512 optimizations - which not everyone might have access to such CPUs - and without MKL and I think you might surmise the results.

To drive the point home, look at the code references in Tensoflow for AVX512 vs MKL:

143 for MKL => https://github.com/tensorflow/tensorflow/search?q=mkl&unscoped_q=mkl
19 for AVX512 => https://github.com/tensorflow/tensorflow/search?q=avx512&unscoped_q=avx512

Now having said that, what do you think could be done to make DeepVariant even faster besides AVX/MKL/CUDA/TPU optimizations?

google / deepvariant

Shared library from the Exome Case Study requires AVX support -- not all users might have this #21