High-Level Questions about "Why DeepVariant" Section

Hi,

I genuinely appreciate that you are providing a lot of detailed information for your free program.

However, as I understand it, I have some high-level questions / concerns:

1) You mention scoring high on the PrecisionFDA calling, but I don't see the term "DeepVariant" or "Google" any where. I apologize for missing something that a lot of people might know, but would you mind explaining how the results in the PrecisionFDA results table match DeepVariant?

2) While it has been hard for me explain myself precisely, I have been concerned that there was somehow over-fitting for some datasets reporting extremely high accuracy. In the PrecisionFDA challenge, I think it should probably be mentioned that there are multiple programs with high percentages, including different workflows that actually use the same variant caller (like GATK) and no strategy was "best" for all the criteria defined.

3) While I don't know the precise way in which you could have a lack of independence for training versus test datasets, I think there are other benchmarks that better match what I would expect for more typical results.

For example, in this paper, it says "In the high-confidence regions, when comparing these pipelines to each other (https://precision.fda.gov/jobs/job-FJpqBP80F3YyfJG02bQzPJBj, link immediately accessible by requesting an account), they agreed on 99.7% of SNVs and 98.7% of indels. Outside the high-confidence regions (https://precision.fda.gov/jobs/job-FJpqJF80F3YyXqz6Kv8Q1BQK), they agreed with each other on only 76.5% of SNVs and 78.7% of indels."

4) You mention "No filtering is needed beyond setting your preferred minimum quality threshold.". While I don't currently have much first-hand experience (although that is on my long-term "to-do" list), I didn't think this was true. For example, unless I am missing something, this table reports a very high "Failed Filters" count for DeepVariant versus GATK:

https://www.nature.com/articles/s41598-018-36177-7/tables/1

Do you have other metrics with comparisons on completely independent samples?

Thank you very much for your help!

Sincerely, Charles

Hello Charles,

Let me take your questions point by point

1) The PrecisionFDA entry is rpoplin-dv42. The corresponds to a substantially earlier version of DeepVariant, but the core elements (deep neural network classification of examples) is the same. Ryan Poplin and Mark DePristo were at Verily at the time, and since transferred to Google Brain. (Verily is partially owned by Google).

2) In the PrecisionFDA Truth Challenge, the HG002 truth set was characterized by NIST but these calls were fully with-held. As a result, this challenge represents the only time that a well-characterized genome was hidden from all developers and offered a unique opportunity to avoid over-fitting. To preserve the validity of this, we never train on HG002 in Illumina data.

As noted above, the version of DeepVariant submitted in PrecisionFDA was early, and subsequent improvements improved accuracy both on this sample and for other instruments and PCR preparations. With these accuracy improvements, DeepVariant unambiguously outperforms other entries in both SNP and Indel:

There were 1689 SNP FN and 832 SNP FP. The same sample with the current version of DeepVariant has 1328 SNP FN and 749 SNP FP.

For Indels, the PrecisionFDA submission has 4175 Indel FN and 2839 Indel FP. The current version of DeepVariant has 1428 Indel FN and 924 Indel FP, a reduction in error of almost 50% compared to the most accurate Indel entry in Precision FDA Truth Challenge.

The DeepVariant paper has the evaluation numbers for the first open source version (https://www.nature.com/articles/nbt.4235) and compares these results of this with the PrecisionFDA entries.

3) There are good other checks which can provide an indirect estimate of quality and which do not require a particular characterized samples. For example, you can call the same sample with GATK and DeepVariant and take the calls only made in one sample or the other. Comparison of the TiTv for those calls present on one or the other can tell you which (on average) has higher quality (indicated by higher TiTv in the singletons for that caller). We perform these evaluations internally as well and would welcome feedback about a similar analysis from you on your own samples.

4) When DeepVariant evaluates a candidate, it can call it as a homozygous variant, heterozygous variant, or indicate that it believes that although there is evidence for a variant at a position, the true call for this position is reference (0/0). In the paper referenced, I believe that these reference calls were considered as failing a filter. However, it is not the case that these are variant calls that were made and had to be removed. Directly taking the genotype for each call would arrive at the same number of variants. In effect, these were not really variant calls to filter. They were rows in the VCF that already did not indicate variation.

We would be enthusiastic to collaborate with you to benchmark DeepVariant against other methods on your own samples with various preparations and coverages if you like.

Thanks, Andrew

Hi Andrew,

Thank you very much for your kind and prompt response.

For point 1 (and part of point 2), thank you very much for that additional information. I think my question was a simpler one: you can see multiple submissions for some groups, and I was trying to see if I understood all the submissions that used DeepVariant. Since there is only one "rpoplin" label (and no other entries from Verily Life Sciences), I'll assume that is the only DeepVariant benchmarks (in contrast to the there being multiple groups using GATK, in pipelines that gave varying results). I am also assuming that no one else was using DeepVariant at that time. However, please correct me if I am wrong.

For point 2, I apologize: it is bad form to critique something without having tested it yourself. I sometimes worry that frequent use of deep learning may represent something that is popular (where many applications may not remain in common use in the long-term), but I need to assess each situation individually. So, I am very sorry about my tone in my initial message.

Because of this post, I am now using DeepVariant as a way to practice learning some new skills in my free-time (such as using cloud computing options), but that makes it harder for me to provide a timely response. While the practice is something that I would like to gain on my own (I believe that I will lose some intuition about the results if I don't run the analysis myself), you are certainly welcome to work with any of the underlying data that I have uploaded to my PGP page.

For point 3, I apologize that I need to take more time to read other papers carefully before citing them. For example, I have pretty much always seen a drop in accuracy for indels versus SNPs. However, if filtering for regions expected to have good concordance, I the loss in indel accuracy wasn't as bad as you might expect from Figure 4 in that paper (however, the concordance in off-target regions was noticeably lower). That said, I think comparisons with my own data are in the ballpark of what they were describing in what I quoted above, although I would have expected the indel accuracy to be in-between (perhaps something more like 90-95%). Nevertheless, even though I am trying to learn more about the DeepVariant, I apologize that I should have taken more time to consider my phrasing.

For point 4, I think I understand your response, but that will probably be more clear as I do some testing with DeepVariant.

Thank you again for your time and detailed response. So, please feel free to close this ticket.

Sincerely, Charles

Hi Again,

I’ve had a chance to do some testing with my own samples, so I thought I should report back:

Going back to point 4, I noticed that you could choose to output a gVCF file in DeepVariant. However, that is not what is done by default (and it’s not what I did), and I can also tell that’s probably not what they did either. Namely, there are over 3 billion base pairs in the human genome: if you add up the passed filters and failed filters for DeepVariant, then that is still only about 9 million sites.

However, I wanted to wait until I had my own DeepVariant results before I said anything else, since it is always possible there may have been something that I wouldn’t expecting in the VCF from DeepVariant.

So, in terms of additional feedback:

5) I compared DeepVariant on Google Cloud (v0.7.2, using code similar to this) Exome versus WGS variant calls with a script that I wrote as well as precisionFDA. My script filters out “complex variants” (with more than one variant at a position), but .vcf files containing those variants weren’t flagged by precisionFDA (even though I did have to do some file re-processing).

So, starting for my test (checking recovery of my CDS Exome variants in my WGS dataset), the recovery numbers were very close to what I originally expected (98-99% for SNPs, ~90% for indels):

Provided Exome on Provided WGS .bam Alignment:

68759 / 72556 (94.8%) full SNP recovery
71276 / 72556 (98.2%) partial SNP recovery
3027 / 3648 (83.0%) full insertion recovery
3413 / 3648 (93.6%) partial insertion recovery
3119 / 3911 (79.7%) full deletion recovery
3596 / 3911 (91.9%) partial deletion recovery

BWA-MEM Exome on WGS (for only on-target alignments):

51417 / 54229 (94.8%) full SNP recovery
53116 / 54229 (97.9%) partial SNP recovery
1964 / 2391 (82.1%) full insertion recovery
2242 / 2391 (93.8%) partial insertion recovery
2058 / 2537 (81.1%) full deletion recovery
2349 / 2537 (92.6%) partial deletion recovery

For comparison, you can see what is reported from precisionFDA (when running DeepVariant on my samples) for the provided .bam alignment or the BWA-MEM re-aligned comparison.

There are two numbers because a “partial” recovery just checks for the presence of a variant (either heterozygous or homozygous). Due to formatting differences (such as with indels), I didn’t feel comfortable showing comparisons between variant callers on my page with notes/code on my Genos Exome sample, but I think this is OK when using the same variant caller on different samples.

For comparison, the precisionFDA comparison for the provided .vcf files is here, and these are the statistics for the provided .vcf files (from my script):

39494 / 41450 (95.3%) full SNP recovery
39678 / 41450 (95.7%) partial SNP recovery

I am omitted the indel statistics from my script because Veritas used freebayes for variant calling (and I’m not converting the indel format, causing the indel count to be quite low, presumably because most overlapped a homopolymer of at least 2 nucleotides). Still, maybe it is interesting that the “full” SNP recovery was higher for the provided variants, but the “partial” SNP recovery was higher for DeepVariant? The precisionFDA comparison (for my samples) shows better results for DeepVariant than the provided VCF files (but there are going to be considerably fewer indels for the overall counts).

Going back to point 4, running DeepVariant with both my provided .bam alignment (after combining the per-chromosome alignments) and BWA-MEM re-aligned reads both resulted in a set of ~7 million reads (which is closer to the total number reported in that other study for DeepVariant)

However, if you emphasize precision, perhaps that somewhat matches my filtered set (in terms of the SNP counts and more noticeable indel differences for homozygous/heterozygous recovery). Namely, I can obtain higher percentages with my script if I use GATK HaplotypeCaller (with --dontUseSoftClippedBases)+VariantFiltration (similar to shown here, for Exome file, but not WGS, and I used GATK version 3.x instead of 4.x) and filter for on-target reads, as shown below:

20765 / 21141 (98.2%) full SNP recovery
20872 / 21141 (98.7%) partial SNP recovery
243 / 258 (94.2%) full insertion recovery
249 / 258 (96.5%) partial insertion recovery
208 / 228 (91.2%) full deletion recovery
213 / 228 (93.4%) partial deletion recovery

That being said, maximizing those recovery statistics does decrease the total number of variants, particularly the indels. Nevertheless, the precisionFDA comparison for that GATK quality-fitered variant set indicates increased precision for the RefSeq CDS variants (but decreased sensitivity), compared to DeepVariant (or an unfiltered GATK variant set).

6) If you go back to the original precisionFDA results, I think this is also consistent with HG002: If you go to “explore results", and select “func_cds” (which is where I would expect most of the clinically actionable variants to be) DeepVariant is not top ranked for either F-score, Recall, or Precision (although the values are high, like the other variant calling results).

I also created a discussion group on precisionFDA asking about how the “truth” set was defined. They mentioned that they focused on regions where variants could be made most confidently (genome-wide), and I’m assuming that is why most of the numbers are so high. Otherwise, they are more in the range of using my same WGS sample and variant caller (DeepVariant) while only changing the alignment, which isn’t really an independent verification (matching my original concern that the percentages being reported seemed unrealistically high).

In other words, I would say DeepVariant performed well, along with other strategies tested. I genuinely believe it is good to have a variety of freely available programs to use, but I believe the "winner" designations from the precisionFDA challenge can be a bit misleading (even though, to be fair, they do color similar high percentiles, even though they also award a designation to one group per category).

If I was the winner of a precisionFDA challenge, I would probably want to mention that somewhere. However, I don't typically see sections like "Why DeepVariant" at the top of most program READMEs. So, along with some observations about run-time and cost, I think it may respectfully be worth considering trimming back some of that information (while continuing to provide excellent support on the issues section of GitHub!).

7) My understanding is that there is not a DeepVariant App on precisionFDA. I think they use AWS, and I may be able to create something unofficial using code similar to my AWS test (relating to issues #166 and #167). However, perhaps at some point, you could consider offering something that can be more officially (and better) supported by DeepVariant developers? This would be free to the users (since the FDA is covering the costs of using the DNAnexus-based interface), but there are some unique differences (like I had to change the chromosome formatting for my .vcf files, and there was an issue with my Veritas WGS header that I had to fix).

I am currently uploading my .fastq files (the .bam alignments are up there and public, but I think the chr format may cause issue with variant calling comparisons). However, all relevant information for these two samples will be publicly available in precisionFDA (from my charles.warden account)

You don’t have to re-open the ticket, but I would certainly welcome any feedback / thoughts that you might have.

Sincerely, Charles

google / deepvariant

High-Level Questions about "Why DeepVariant" Section #165