Deepvariant fails to recognize PacBio BAM index file

Npaffen commented 1 year ago

Have you checked the FAQ? https://github.com/google/deepvariant/blob/r1.5/docs/FAQ.md: Yes. Describe the issue: I followed the PACBIO example and also added the flags from this issue.

This error indicates that Deepvariant is not able to find an index file for the bam file but the index file is there see :

  (base) ✔ /media/nils/nils_ssd_01/Calling/HiFI_sequencing/data/bam[master L|…5] 
  22:08 $ ls
  GFX.bam  GFX.bam.pbi  GFX_hg19.bam  GFX_hg19.bam.pbi   readlength.txt  tmp

I also re-indexd the file using pbindex from pbbam. As one can see from the ls output I also tried to realign the bam file to some other reference panel using pbmm2.

Setup

Operating system: Linux Mint 21.1 x86_64
Kernel: 5.15.0-69-generic
DeepVariant version: 1.5.0
Installation method (Docker, built from source, etc.): Docker image run through Singularity
Singularity Verion : singularity-ce version 3.11.3
Type of data: (sequencing instrument, reference genome, anything special that is unlike the case studies?) PACBIO CCS data aligned to GRCh37.fa reference genome. No special observations in the file can be reported.

(base) ✔ /media/nils/nils_ssd_01/Calling/HiFI_sequencing/data/bam [master L|…5]                                                                                                                                                            
20:46 $ samtools flagstat GFX.bam 
^[[1;5C940551 + 0 in total (QC-passed reads + QC-failed reads)
881297 + 0 primary
0 + 0 secondary
59254 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
940551 + 0 mapped (100.00% : N/A)
881297 + 0 primary mapped (100.00% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Steps to reproduce:

Command: Follow the PACBIO example and install the mentioned singularity version. Then run

singularity exec --bind /usr/lib/locale/,/media/nils/nils_ssd_01/Genomics_prac_guide/reference/unsorted/,/media/nils/nils_ssd_01/Calling/HiFI_sequencing/data/ docker://google/deepvariant:1.5.0 /opt/deepvariant/bin/run_deepvariant --model_type PACBIO --ref /media/nils/nils_ssd_01/Genomics_prac_guide/reference/unsorted/hg19.fa --reads /media/nils/nils_ssd_01/Calling/HiFI_sequencing/data/GFX.bam --output_vcf /media/nils/nils_ssd_01/Calling/HiFI_sequencing/data/GFX.vcf.gz --num_shards 22 --dry_run=false --make_examples_extra_args='sort_by_haplotypes=false,parse_sam_aux_fields=false'

- Error trace:

The error message including the pipeline call is :

   E::idx_find_and_load] Could not retrieve index file for '/media/nils/nils_ssd_01/Calling/HiFI_sequencing/data/bam/GFX.bam'
2023-06-20 22:48:40.272832: W third_party/nucleus/io/sam_reader.cc:131] Unknown tag pb: in header line, ignoring: @HD   VN:1.6  SO:coordinate   pb:5.0.0
2023-06-20 22:48:40.272881: W third_party/nucleus/io/sam_reader.cc:174] Unknown tag BC: in RG line, ignoring: @RG   ID:408781da/26--26  PL:PACBIO   DS:READTYPE=CCS;BINDINGKIT=101-894-200;SEQUENCINGKIT=101-826-100;BASECALLERVERSION=5.0.0;FRAMERATEHZ=100.000000;BarcodeFile=m64023e_230515_162401.barcodes.fasta;BarcodeHash=86d73e586a6d3ede0295785b51105eea;BarcodeCount=96;BarcodeMode=Symmetric;BarcodeQuality=Score    LB:Pool_18_20_GFX0455704_GFX    PU:m64023e_230515_162401    SM:GFX
PM:SEQUELII BC:TGACTGTAGCGAGTAT CM:S/P5-C2/5.0-8M
2023-06-20 22:48:40.272889: W third_party/nucleus/io/sam_reader.cc:174] Unknown tag CM: in RG line, ignoring: @RG   ID:408781da/26--26  PL:PACBIO   DS:READTYPE=CCS;BINDINGKIT=101-894-200;SEQUENCINGKIT=101-826-100;BASECALLERVERSION=5.0.0;FRAMERATEHZ=100.000000;BarcodeFile=m64023e_230515_162401.barcodes.fasta;BarcodeHash=86d73e586a6d3ede0295785b51105eea;BarcodeCount=96;BarcodeMode=Symmetric;BarcodeQuality=Score    LB:Pool_18_20_GFX0455704_GFX    PU:m64023e_230515_162401    SM:GFX
PM:SEQUELII BC:TGACTGTAGCGAGTAT CM:S/P5-C2/5.0-8M
I0620 22:48:40.273113 140449601988416 genomics_reader.py:222] Reading /media/nils/nils_ssd_01/Calling/HiFI_sequencing/data/bam/GFX.bam with NativeSamReader
I0620 22:48:40.273825 140449601988416 make_examples_core.py:257] Task 8/22: Writing examples to /tmp/tmpwjk24y8t/make_examples.tfrecord-00008-of-00022.gz
I0620 22:48:40.274013 140449601988416 make_examples_core.py:257] Task 8/22: Overhead for preparing inputs: 0 seconds
Traceback (most recent call last):
  File "/tmp/Bazel.runfiles_ztv_d7ra/runfiles/com_google_deepvariant/deepvariant/make_examples.py", line 196, in <module>
  app.run(main)
File "/tmp/Bazel.runfiles_ztv_d7ra/runfiles/absl_py/absl/app.py", line 312, in run
_run_main(main, args)
File "/tmp/Bazel.runfiles_ztv_d7ra/runfiles/absl_py/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "/tmp/Bazel.runfiles_ztv_d7ra/runfiles/com_google_deepvariant/deepvariant/make_examples.py", line 186, in main
make_examples_core.make_examples_runner(options)
File "/tmp/Bazel.runfiles_ztv_d7ra/runfiles/com_google_deepvariant/deepvariant/make_examples_core.py", line 2183, in make_examples_runner
runtimes) = region_processor.process(region)
File "/tmp/Bazel.runfiles_ztv_d7ra/runfiles/com_google_deepvariant/deepvariant/make_examples_core.py", line 1285, in process
sample_reads = self.region_reads_norealign(
  File "/tmp/Bazel.runfiles_ztv_d7ra/runfiles/com_google_deepvariant/deepvariant/make_examples_core.py", line 1376, in region_reads_norealign
  reads = itertools.chain(reads, sam_reader.query(region))
  File "/tmp/Bazel.runfiles_ztv_d7ra/runfiles/com_google_deepvariant/third_party/nucleus/io/genomics_reader.py", line 247, in query
  return self._reader.query(region)
  File "/tmp/Bazel.runfiles_ztv_d7ra/runfiles/com_google_deepvariant/third_party/nucleus/io/sam.py", line 250, in query
  return self._reader.query(region)
  ValueError: FAILED_PRECONDITION: Cannot query without an index
  parallel: This job failed:
    /opt/deepvariant/bin/make_examples --mode calling --ref /media/nils/nils_ssd_01/Genomics_prac_guide/reference/GRCh37/hs37d5.fa --reads /media/nils/nils_ssd_01/Calling/HiFI_sequencing/data/bam/GFX.bam --examples /tmp/tmpwjk24y8t/make_examples.tfrecord@22.gz --add_hp_channel --alt_aligned_pileup diff_channels --max_reads_per_partition 600 --min_mapping_quality 1 --noparse_sam_aux_fields --partition_size 25000 --phase_reads --pileup_image_width 199 --norealign_reads --nosort_by_haplotypes --track_ref_reads --vsc_min_fraction_indels 0.12 --task 1

Does the quick start test work on your system? Please test with https://github.com/google/deepvariant/blob/r0.10/docs/deepvariant-quick-start.md. Quickstart works. This issue also happens when I try to run the pipeline with docker-only.

**Any additional context:**

pgrosu commented 1 year ago

@Npaffen Could you try generating .bai index files for the bam files with samtools index. I'm not sure htslib in Nucleus is recognizing the PacBio .pbi index files.

Npaffen commented 1 year ago

I will do that. So there is no difference between a Pacbio .pbi index file and a samtools index output .bai file?

Npaffen commented 1 year ago

This worked but now another problem appeared that seems to be related to this one. Seems the samtools index does not work:

Current thread 0x00007f283e6c9740 (most recent call first): File "/tmp/Bazel.runfiles_qz56zi29/runfiles/com_google_deepvariant/deepvariant/make_examples_core.py", line 976 in _make_allele_counter_for_region [E::fai_retrieve] Failed to retrieve block: unexpected end of file 2023-06-21 07:21:53.632147: F ./thirdparty/nucleus/vendor/statusor.h:231] Non-OK-status: status status: INVALID_ARGUMENT: Couldn't fetch bases for reference_name: "1" start: 2295000 end: 2330000 Fatal Python error: Aborted

Current thread 0x00007fe7f6689740 (most recent call first):
  File "/tmp/Bazel.runfiles_tfz52rn8/runfiles/com_google_deepvariant/deepvariant/make_examples_core.py", line 976 in _make_allele_counter_for_region
  File "/tmp/Bazel.runfiles_tfz52rn8/runfiles/com_google_deepvariant/deepvariant/make_examples_core.py", line 1548 in candidates_in_region
  File "/tmp/Bazel.runfiles_tfz52rn8/runfiles/com_google_deepvariant/deepvariant/make_examples_core.py", line 1318 in process
  File "/tmp/Bazel.runfiles_rexajgra/runfiles/com_google_deepvariant/deepvariant/make_examples_core.py", line 976 in _make_allele_counter_for_region
  File "/tmp/Bazel.runfiles_rexajgra/runfiles/com_google_deepvariant/deepvariant/make_examples_core.py", line 1548 in candidates_in_region
  File "/tmp/Bazel.runfiles_rexajgra/runfiles/com_google_deepvariant/deepvariant/make_examples_core.py", line 1318 in process
  File "/tmp/Bazel.runfiles_rexajgra/runfiles/com_google_deepvariant/deepvariant/make_examples_core.py", line 2183 in make_examples_runner
  File "/tmp/Bazel.runfiles_rexajgra/runfiles/com_google_deepvariant/deepvariant/make_examples.py", line 186 in main
  File "/tmp/Bazel.runfiles_rexajgra/runfiles/absl_py/absl/app.py", line 258 in _run_main
  File "/tmp/Bazel.runfiles_rexajgra/runfiles/absl_py/absl/app.py", line 312 in run
  File "/tmp/Bazel.runfiles_rexajgra/runfiles/com_google_deepvariant/deepvariant/make_examples.py", line 196 in <module>
  File "/tmp/Bazel.runfiles_qz56zi29/runfiles/com_google_deepvariant/deepvariant/make_examples_core.py", line 1548 in candidates_in_region
  File "/tmp/Bazel.runfiles_qz56zi29/runfiles/com_google_deepvariant/deepvariant/make_examples_core.py", line 1318 in process
  File "/tmp/Bazel.runfiles_qz56zi29/runfiles/com_google_deepvariant/deepvariant/make_examples_core.py", line 2183 in make_examples_runner
  File "/tmp/Bazel.runfiles_qz56zi29/runfiles/com_google_deepvariant/deepvariant/make_examples.py", line 186 in main
  File "/tmp/Bazel.runfiles_qz56zi29/runfiles/absl_py/absl/app.py", line 258 in _run_main
  File "/tmp/Bazel.runfiles_qz56zi29/runfiles/absl_py/absl/app.py", line 312 in run
  File "/tmp/Bazel.runfiles_qz56zi29/runfiles/com_google_deepvariant/deepvariant/make_examples.py", line 196 in <module>
  File "/tmp/Bazel.runfiles_tfz52rn8/runfiles/com_google_deepvariant/deepvariant/make_examples_core.py", line 2183 in make_examples_runner
  File "/tmp/Bazel.runfiles_tfz52rn8/runfiles/com_google_deepvariant/deepvariant/make_examples.py", line 186 in main
  File "/tmp/Bazel.runfiles_tfz52rn8/runfiles/absl_py/absl/app.py", line 258 in _run_main
  File "/tmp/Bazel.runfiles_tfz52rn8/runfiles/absl_py/absl/app.py", line 312 in run
  File "/tmp/Bazel.runfiles_tfz52rn8/runfiles/com_google_deepvariant/deepvariant/make_examples.py", line 196 in <module>
parallel: This job failed:
/opt/deepvariant/bin/make_examples --mode calling --ref /media/nils/nils_ssd_01/Genomics_prac_guide/reference/GRCh37/hs37d5.fa --reads /media/nils/nils_ssd_01/Calling/HiFI_sequencing/data/bam/GFX.bam --examples /tmp/tmpcawuh1vp/make_examples.tfrecord@22.gz --add_hp_channel --alt_aligned_pileup diff_channels --max_reads_per_partition 600 --min_mapping_quality 1 --noparse_sam_aux_fields --partition_size 25000 --phase_reads --pileup_image_width 199 --norealign_reads --nosort_by_haplotypes --track_ref_reads --vsc_min_fraction_indels 0.12 --task 0

real    0m19,089s
user    0m15,039s
sys 0m1,066s

pgrosu commented 1 year ago

@Npaffen BAI and PBI have similar concepts, but different approaches. BAI is an R-tree approach, while PBI seems to be HDF5-like -- basically different formats.

Regarding the error you see is because your BAM and reference don't match in their sequence contig names. For instance, the reference sequence has the chromosome naming convention of chr1, chr2, etc., but in the error I see 1 as a name to match against. So if you type the following command for you BAM file:

samtools view -H GFX.bam | grep @SQ

You should see something like this for the same naming convention for the SN field:

@SQ     SN:chr1 LN:249250621
@SQ     SN:chr2 LN:243199373
@SQ     SN:chr3 LN:198022430
@SQ     SN:chr4 LN:191154276
@SQ     SN:chr5 LN:180915260
@SQ     SN:chr6 LN:171115067
@SQ     SN:chr7 LN:159138663
@SQ     SN:chr8 LN:146364022

Also I'm assuming you used the same reference you aligned your reads against when generating your BAM files, as the reference used for DeepVariant.

Let me know if there is something I should expand on.

Thanks, ~p

Npaffen commented 1 year ago

Hey @pgrosu thanks for your quick reply. Looks like I used the wrong reference file for the original bam file indeed. Sadly the header of the original bam file did not contain any information which reference file they aligned the reads to. So I found the correct reference file and the pipeline finished successfully. Two uncertainties are left regarding the PACBIO model :

1) The called variants are not phased. If I understood the pipeline correctly phasing should be part of deepvariants PACBIO mode in since v.1.1 right? Is there any specific parameter one need to supply to activate this? Is it good enough to just phase the vcf output from my pipeline I stated in the thread opener with WhatsHap?

2) Some of the calls seem to be HomREF and some of them are missing despite marked as HomREF in the FILTER column. Is this a normal behavior for PACBIO data? From my expertise I would have expected to not observe HomREF or missings in a VCF but maybe in a g.vcf. Am I missing something important here?

     #CHROM   POS ID REF ALT QUAL  FILTER INFO             FORMAT                      GFX
      1:      1 10031  .   T   C  0.0 RefCall    . GT:GQ:DP:AD:VAF:PL   0/0:29:7:4,3:0.428571:0,30,36
      2:      1 10037  .   T   C  0.8 RefCall    . GT:GQ:DP:AD:VAF:PL     ./.:8:7:3,4:0.571429:0,7,18
      3:      1 10043  .   T   C  0.1 RefCall    . GT:GQ:DP:AD:VAF:PL   ./.:17:7:5,2:0.285714:0,16,30
      4:      1 10055  .   T   C  2.1 RefCall    . GT:GQ:DP:AD:VAF:PL     ./.:4:7:3,4:0.571429:0,2,16
      5:      1 10067  .  TA   T  0.0 RefCall    . GT:GQ:DP:AD:VAF:PL   0/0:20:7:3,2:0.285714:0,19,32
     ---                                                                                             
3912752:     MT  7331  .   C   A 68.8    PASS    . GT:GQ:DP:AD:VAF:PL        1/1:65:43:0,43:1:68,67,0
3912753:     MT  8843  .   T   C 68.1    PASS    . GT:GQ:DP:AD:VAF:PL        1/1:65:37:0,37:1:68,68,0
3912754:     MT  8860  .   A   G 67.9    PASS    . GT:GQ:DP:AD:VAF:PL        1/1:63:40:0,40:1:67,64,0
3912755:     MT 12005  .  TG   T  0.0 RefCall    . GT:GQ:DP:AD:VAF:PL 0/0:65:43:37,6:0.139535:0,65,74
3912756:     MT 15326  .   A   G 67.9    PASS    . GT:GQ:DP:AD:VAF:PL        1/1:65:37:0,37:1:67,68,0

Best

Nils

pgrosu commented 1 year ago

Hi @Npaffen,

That's great to hear that it worked! Let me answer each question individually:

1 DeepVariant for the PACBIO model enables --phase-reads as shown below, but it is only used internally for improving the accuracy by expanding the region of finding candidates:

https://github.com/google/deepvariant/blob/r1.5/scripts/run_deepvariant.py#L255

This will not show phased reads in the VCF file. You are correct in that you will need to use a tool like WhatsHap to update the VCF file with phasing information.

2 The output for the filter column is generated in the following way for most cases (biallelic):

2.1 call_variants generates the genotype probabilities using the PACBIO model (for this case), for the candidates generated by make_examples.

2.2 postprocessing_variants operates all of this through the add_call_to_variant() function:

2.2.1 For each variant's predicted genotype probabilities it looks at the highest probability, denoting that being the most likely genotype. It then generates it via the most_likely_genotype(predictions, ploidy=2, n_alleles=2) function.

2.2.2 It then takes that most likely genotype and uses it compute_filter_fields() function, to generate the appropriate filter. Basically it, strictly looking at the genotype and it if it sees it as [0, 0] (i.e. 0/0) it will label the Filter column as RefCall.

2.2.3 Then the uncall_homref_gt_if_lowqual() function will label the genotype of each variant with a GQ value < 20 (set via the --cnn_homref_call_min_gq flag) as [-1,-1] which is equivalent to ./. genotype. This is clarified in the genotype field under the VariantCall ProtoBuf definition in Nucleus.

Let me know if this helps clarify things.

Thanks, Paul

Npaffen commented 1 year ago

Hi @pgrosu;

thanks for the clarification. So when you deepvariant uses --phase-reads does this mean that during processing of the (long) reads those get phased instead of variants to improve the calling? If so could you go a little bit more into details regarding this topic?

How do you guys at deepvariant define a variant in a PacBio context? From my understand HomRef variants do nat actually vary compared to the reference panel and if one calls for example a WGS file generated with Ilumina so that one ends up with short reads, I always obtain a gvcf as an intermediate file containing haplotype blocks of (non-informative) HomRef calls while the actual vcf contains only informative calls so non-HomRef variants.

Best, Nils

AndrewCarroll commented 1 year ago

Hi @Npaffen

Yes, in phase variants, the reads are assigned a haplotag value. Briefly, in this process, a set of potential variants are scored with heuristics (no neural network) on the likelihood that they are heterozygous variants. A cluster of such variants forms a candidate seed for a haplotype. The evidence from multiple reads across multiple positions are used to identify the putative variants on that haplotype, and then reads are scored based on whether they fall into one of the haplotypes, the other, or cannot be phased. Because this haplotagging uses information from much longer stretches and more candidate variants than the individual process of variant calling, it has the advantage of a broader set of information.

This haplotagging is used to populate the information in the "haplotype channel" which is one of the inputs for DeepVariant long read data. We wrote a blog describing this channel and its impact.

Note that this process is only used to provide the information to the neural network for consider, the neural network will be able to learn when this channel is or is not reliable based on genome context, coverage, etc... the network's call on the genotype is what finally goes into a variant.

As a result, haplotag is not used as input to generate the non-ref blocks of the gVCF, and as the final variants called are still from the neural network, the definition of a variant remains the same - a position with an ALT allele that receives a non-reference (0/0 or ./.) call.

We are currently working on a deeper description of the phasing logic used in DeepVariant, which may help understand or reproduce the haplotag method more easily.

Please let me know if anything in the explanation is unclear or can be elaborated further.

Npaffen commented 1 year ago

Thanks for the explanation @AndrewCarroll. Gave me a good insight about the applied phasing menchanic! Still I have one point I don't fully get from your answer:

position with an ALT allele that receives a non-reference (0/0 or ./.) call.

I'm not sure if I understand this part correctly. From my understanding an observed ALT allele that recieves a non-reference call is heterozygous or homozygous ALT so either (0/1,1/0 or 1/1). What do you mean by non-reference (0/0 or ./.) ?

pgrosu commented 1 year ago

Thank you @AndrewCarroll.

@Npaffen I think Andrew meant to use a comma instead of parentheses: non-reference, 0/0 or ./. call

AndrewCarroll commented 1 year ago

Hi @Npaffen @pgrosu

My sentence is written inexactly. I was thinking of the last word in the hyphen - reference in non-reference, and trying to clarify the reference calls. But this was more confusing rather than less.

Reference calls - 0/0 or ./. Variant calls - 0/1, 1/1, 1/2, 0/2, etc.

Npaffen commented 1 year ago

So if one observes a reference call at some loci the sequencer observed an actual ALT allele in some of the reads but the majority of the reads led to some higer GP for a reference call instead of a variant call? If one observed a missing the same reference call as described before had to be made with too much uncertainty?

pgrosu commented 1 year ago

Hi @Npaffen,

Let me give a little background first before answering each question, and some of this you can find in the following two blogs: Blog 1 and Blog 2. For each analysis, make_examples generates 6 matrices (usually with a height: 100 and width: 199) with values ranging from 0-255, piling them up (stacked) to form an (image) tensor of (100 x 199 x 6). DeepVariant refers to these as channels. These 6 channels contain the following representation (which get transformed to those value ranges):

Read base: different intensities represent A, C, G, and T, which is detailed in the BaseColor() function.
Base quality: set by the sequencing machine. A higher number means higher quality.
Mapping quality: set by the aligner. A higher number means higher quality.
Strand of alignment: For forward it is 70, and for reverse it is 240.
Read supports variant: High number means the read supports the given alternate allele, and lower number means it does not.
Base differs from ref: A higher number means the base is different from the reference, while a lower number means the base matches the reference.

There is one more thing, each of these channels contains in their first 5 rows the reference representation as necessary. This way you can ensure reference representation together with alternate representation. If you have multiple alternates, then you have more images as shown at the following PileupImageCreator() comment, to ensure the genotype is calculated with the best representation of possibilities.

Now let me switch topics on how the model is constructed at a high level, as it is important for answering the questions. The model is a convolutional neural network (CNN) based on the Inception V3 architecture, and was trained using a truth set of VCFs for confident regions using high-quality BAM files. A CNN basically slices with different sub-matrices (i.e. kernel) shapes within each channel to generate new summary layers of channels to use for further slicing or combining (pooling), with the eventual goal to match with a high probability a specific truth value in the VCF. The way this is done, is by varying the values in these networks of kernels (sub-matrices) until with high-probability the truth values as validated. This collection of trained kernels is what the PacBio model contains.

So now to answer the two questions in order:

1) If you think of the PacBio model retaining previously seen (trained) regions of portions of the channels, it will go through the kernel transformation pooled across the channels to determine the probabilities of genotypes. The highest value will be used to Phred-scale the GQ value. So it's a bit more complex as the sub-slices are taken across a 100 x 199 region for 6 channels, rather than just a 100 x 1 (where the majority ref-representation of alleles reside), but that will also have an impact in the matrix calculations of probabilities of most represented genotype.

2) Now if the probability is low, then the model will still try to assign probabilities given what it has seen across the whole region and all channels, to infer the genotype probabilities, but that does not mean they will be high. Which is why there is the uncall_homref_gt_if_lowqual() function to adjust for that afterwards, giving the ./. correction.

Hope it helps, ~p

Npaffen commented 1 year ago

I chewed on this the last days but I still have a hard time to understand why the homref variants and the missings are added to the vcf in the first place. Could you try to repeat this for me in case you already answered this but I was unable to emphasize this in your answer?

pichuan commented 1 year ago

Hi @Npaffen , I'm late to this thread. If I'm missing some context please feel free to remind me.

Regarding your question "why the homref variants and the missings are added to the vcf in the first place":

DeepVariant starts with a set of candidates. These candidates came from a set of heuristics that propose a bunch of sites that potentially have variants. You can find some thresholds we use for the heuristics here: https://github.com/google/deepvariant/blob/r1.5/deepvariant/make_examples_options.py#L178-L194 which basically means - if there is an alt allele that has a certain number of reads supporting it, and a certain % of reads supporting it, it will be proposed as a potential candidate.

Then, DeepVariant applies a classifier on these candidates. To learn more about the representation that DeepVariant uses, https://google.github.io/deepvariant/posts/2020-02-20-looking-through-deepvariants-eyes/ is a good explanation. Each of the example ("image") has a probability distribution for 3 classes. Based on the probability, the GT is assigned. As mentioned in previous answers, when the probability distribution from DeepVariant shows that it's not as confident, the GT is set to ./.

By default, DeepVariant outputs the candidates even when they're classified as 0/0, or when they're set to ./.. This won't affect downstream tools like hap.py, though. Because these are not considered when tools like hap.py calculates accuracy.

What DeepVariant outputs complies with the VCF spec, which says "FILTER - filter status: PASS if this position has passed all filters, i.e., a call is made at this position." If you look at our VCF, you'll notice that all variants (not including 0/0 and ./.) should have PASS.

Hopefully this helps. @Npaffen let me know if there's anything else that's unclear.

Npaffen commented 1 year ago

Yes. This helps and from a model perspective this makes sense to add every site to the VCF that is a potential candidate for a variant. The last question I might raise is: What is your intention to have those positions added to the VCF by default and not as a opt-in option. To me it sound rather counterintuitive because I would argue that the majority of use-cases do not need those positions and/or removed them in downstream-analysis. But ofc this is you decision and just my view on the overall concept and I might be wrong here!

Thanks for all you help and patience so far!

AndrewCarroll commented 1 year ago

Hi @Npaffen

This is an interesting question. One advantage is that is these sites give a complete accounting of all positions that the neural network acts on. This makes it easier for us to debug issues (either internally or user-presented) as we remove the variable of whether a row in the VCF was not present because a candidate was not generated, or because the neural network decided it was reference. From a developer perspective (whether internal or external), this is much cleaner.

There is another advantage in allowing users to determine the ideal threshold for sensitivity (e.g. if they highly value sensitivity, they may want to consider positions which are ./. calls of low GQ). These are more advanced use cases and it does seem that most users opt for simpler approaches.

I also tend to think that including these variants won't confuse users, and that they would be used to variant callers using things like the FILTER field to indicate non-variant calls. But if you have a strong opinion that this will be broadly confusing, I will take that into account.

pgrosu commented 1 year ago

Hi @Npaffen,

To expand a bit on my previous explanation, regarding the way you can view a variant coming from a neural network is through the idea of preserving $information$ $propagation$. You saw the previous description of how the different channels get encoded, but let's start with a simpler version. Let say you have a one-line matrix with the following columns:

This could denote a simplified version of your read base representation, where you notice the middle denotes the variant, and the last column the channel it represents. Now I want to create new types of data from this, which will help me with identifying unique areas of patterns within it. This could be of the form of transformation of the values to ranges that are easier to detect differences among columns. One of these can be dividing all the values by 10, and labeling that channel 2. In this transformation, 10 represents a kernel I described previously and the output is a new feature map (a transformed matrix that helps with detecting unique features based on the numerical representation). Now the data would look like this:

Now imagine I create different transformations of these rows, to expand on specific areas among these values where intriguing patterns might emerge. Suppose I create 5 different transformations having then 5 channels with multiple copies of each row, in order to have a fuller dataset that mimics the number of reads. This data is multi-dimensional as it contains different values of X. This can be pretty hard to interpret, but we can collapse these differences to a 2D representation using t-SNE plots to visualize these differences. For example, if I do that to this dataset, I get the following plot:

This helps me understand which channels (feature maps) might be similar, or different. Now if we look at the heatmap for the original data it would look something like this (notice the variant in the middle):

If I generate a t-SNE plot of this, I see something like the following using colors to denote the different channels (feature maps):

Now comes the fun part! Let's have different matrices (like the ones that generated the new channels above) that will identify interesting features that might expose one type of a genotype or another with confidence. Imagine that instead of dividing by 10, I find the values (matrices) that best helps separate the data in a BAM file I know should have specific variants at specific loci. I want to maximize that precision to be able to recall. Now you notice that the original data and transformations (feature maps) are linked preserving this $propagation$ $of$ $information$, as this flow of information is enabled through these sets of transformations. An interesting thing then begins to emerge as you move up the layers of transformation. For example, early on in the neural network's set of transformations you will see patterns like this:

You might notice an explosion of features, with no specific patterns. These early steps are to generate a large variety of features to be able to have selection power for the later layers to use as input, for helping with the separation into distinct patterns for mapping to the different classes of genotypes confidently. For example, you can see distinct patterns forming as it reaches the later stages:

Once the pattern has been achieved like the following, then one can proceed with testing each genotype's representation of the variant:

We want to see for which genotype the set of patterns (the feature map above) maximizes for, which will indicate the genotype present with a specific maximal probability.

First we test for $homozygous$ $reference$:

Next we test for $heterozygous$:

Finally we test for $homozygous$ $alternate$:

Now we can see there is a significant correlation with a heterozygous variant call.

So to call a variant site's genotype you now have the power to traverse the whole neural network -- through a set of transformations (you can tweak) to identify hidden patterns in the data -- as it is all linked from the start (read encoding) to finish (genotype identification). This ability is enabled through the preservation of the propagation of information (which is deeply and directly linked) in order to help one optimize upon. That's why you don't want to throw anything away, as @AndrewCarroll and @pichuan mentioned as it helps you inspect and tweak the selection power and criteria as it is all interconnected.

Hope it helps, Paul

Npaffen commented 1 year ago

@AndrewCarroll and @pgrosu thanks for your view on things and @pgrosu especially for your fantastic explanation of the model!

From a data scientist perspective this makes totally sense. I fully agree that one should always gather as much data as possible since one can always remove data that seems to be not useful afterwards. My intention was not to say that one do not need those data or that you should not use the information which those HomRef sites emerged from. I talked to some of my colleagues yesterday and they agreed that they would expect a variant caller to produce variant calls and information/statistics related to those and all further information would be opt-in since they would have no need for this. But my team could just be the minority here and simply an opt-out option for this behavior would be highly recommended!

Npaffen commented 1 year ago

In my downstream analysis with WhatsHap today I found that the VCF contains only 3,889,968. The data comes from PacBio CCS 10x sequencing and I wonder if this can be the correct number of SNPs for this method. E.g. I analyzed several WGS 30x datasets from Nebula so far and spotted amounts of SNPs in the range of 4.6-4.8 million. Is there anything that I missed here?

pgrosu commented 1 year ago

@Npaffen Thank you for the kind words. Maybe trying the following two approaches might show additional benefits:

1) PEPPER-Margin-DeepVariant could help rescue some of these SNPs that might reside in low-complexity regions like homopolymer, dimer and trimer repeat regions. You can read more details in this post or in the paper.

2) Another novel approach could be with this new paper which incorporates Flye/Shasta/Hapdup for de novo assembly, and HapDiff to call structural variants with methylation tagging (via Guppy via Remora) into a combined variant call set in Margin with also using the PEPPER-Margin-DeepVariant for variant calling. Since you are using PacBio this would need to be tweaked slightly. I just read that paper, so I have not tried their pipeline yet.

AndrewCarroll commented 1 year ago

Hi @Npaffen

For the lower number of variants, I would suspect that the main driver would be coverage here. 10x is a relatively lower coverage. I will try to find some time to see if we have a 10x coverage PacBio run from a sample handy to let you know how many variants we typically see at that range.

Another possibility is that samples do naturally differ in their number of variants, especially based on the ancestry of the individuals analyzed.

Npaffen commented 1 year ago

Thanks for all the input again! Is it possible to use DeepTrio with combined PacBio (parents) and low-coverage Illumina data right now?

pgrosu commented 1 year ago

@Npaffen Based on the DeepTrio paper the model has also been trained for duo-calling. In the paper, you will notice coverage results with improvements in the lower-coverage scenarios benchmarked against GIAB -- though the lowest I see is 15x. Regarding having combined results, there could be an issues regarding chromosome X for the non-PAR regions as shown in the paper, but maybe the model might already take into account GIAB labels for X and Y. To get around combined effects, you should be able to split your BAM files by sample, and then use samtools merge to combine them together organizing them into BAM files by child, mother and father. In any case, you can see from the paper that the input channels are stacked between parent and child, which can hint at how mixing samples might impact the genotype detection from the model as described earlier.

Npaffen commented 1 year ago

I'm sorry I missed a crucial part in my question. What I meant is : Is it possible to use any model if the parents come from CCS PacBio and the child from Illumina WGS. Thanks anyway for your brief explanation of the overall model!

pichuan commented 1 year ago

Hi @Npaffen Currently we don't have DeepTrio models that are trained with the setting you described (parents come from CCS PacBio and the child from Illumina WGS).

pgrosu commented 1 year ago

@Npaffen As @pichuan mentioned, I would be hesitant as well to combine them as these two models seem to have been designed with different parameters in mind as shown here (especially the image width, haplotype sorting and alternate aligned pileups). You can get more details in the following blog as to how some of these parameters are utilized and minimize errors for different models.

pgrosu commented 1 year ago

@Npaffen I noticed a hybrid model HYBRID_PACBIO_ILLUMINA where one can combine PacBio with Illumina reads. You can read more about it in the following case study. This is new to me as well, so I need to read a bit more about it first.

pichuan commented 1 year ago

Hi @pgrosu , The hybrid model you mentioned is combining PacBio and Illumina reads of the same individual. It is a DeepVariant model, not a DeepTrio model.

With the hybrid model, you can achieve better accuracy. This model was first developed as part of our submission to https://precision.fda.gov/challenges/10/results.

However, this is not a DeepTrio model and does not fit the use case that @Npaffen described unfortunately.

pichuan commented 1 year ago

Hi @pgrosu , The hybrid model you mentioned is combining PacBio and Illumina reads of the same individual. It is a DeepVariant model, not a DeepTrio model.

With the hybrid model, you can achieve better accuracy. This model was first developed as part of our submission to https://precision.fda.gov/challenges/10/results.

However, this is not a DeepTrio model and does not fit the use case that @Npaffen described unfortunately.

Npaffen commented 1 year ago

@pgrosu @pichuan thanks for your continues help in various questions I raised in this thread. Really helpful. How likely is it to generate a DeepTrio model in the future to use such a data combination? I guess it is much more than simply training a model on such pedigrees, right?

AndrewCarroll commented 1 year ago

Hi @Npaffen

Our typical recommendation for this would be to use DeepVariant WGS model to call variants with gVCF output in the Illumina samples, DeepVariant PACBIO model to call variants with gVCF output in the PacBio sample, and then use glnexus to jointly genotype the three samples together.

The variant call probabilities are well calibrated in each model individually, and so glnexus can operate on them.

pgrosu commented 1 year ago

Thank you @AndrewCarroll and @pichuan for the clarification. The calibration makes sense, and could be intriguing for inspecting DNN-resiliency. Having the same underlying Inception V3 network architecture for both PacBio and WGS, a point of natural comparability would be the logits kernel across all three genotypes:

Given visual similarity, these were confirmed via Euclidean distance (0.9931127, 0.8543731 and 1.052052, respectively). This indicates the feature set might exhibit strong similarity for interpretation.

Looking at one network (PacBio), it might be possible to confirm calibration by testing for network-resiliency. Via perturbation analysis it should be possible to get insight into a channel's response under perturbation, and their binary interactions under such conditions. Keeping the variant unchanged within a window on each side for preserving the call, the inspection each channel vulnerability response to perturbation can be tested. This resulted in the following perturbation response ($c\_*$ denotes a channel, and $i\_*\_*$ represents a binary interaction between two channels):

The above can be mapped into a network of interactions among the channels:

Based on the above mapping, by testing well-interacting channels through a probabilistically value-update -- within DeepVariant-acceptable values -- it might be possible to check for shifts in genotype mimicking Mendelian violation. Selecting base_quality and staying within DeepVariant's minimum acceptable value, random sampling with replacement was performed in the window outside the variant region. A shift in genotype was achieved giving a measure of network resiliency. Other channels being more strongly-connected could provide more aggressive shifts in genotype. This can be offset by restrictions in convolutional motifs within the network, or be shifted to the code via threshold limitations in make_examples.

Thank you and Happy 4th of July! Paul

pichuan commented 1 year ago

Thanks @pgrosu for the analysis! If i understand correctly, the original question in the thread has been resolved. Thanks all for the following discussions in this thread. I'll close this issue now so it's easier for our team to track.

pgrosu commented 1 year ago

Thank you for the kind words @pichuan. I think the original question is probably well-answered, but @Npaffen is probably a better person to confirm with.

As a nice side note, this discussion has provided me more insight into the $data \leftrightarrow model$ interactions, revealing some interesting properties of the model's internal state and interactions of substructures $-$ which might have the potential for a paper $-$ but I would need to do some more preliminary work first to confirm some ideas.

Thank you, Paul

Npaffen commented 1 year ago

Yeah thanks too all here in this thread for the productive discussion. This helped me a lot to understand how deepvariant works! Good luck with the project!

pgrosu commented 1 year ago

Hi Nils,

Glad to hear it was helpful, and I enjoyed it as well! Yeah the project is pretty amazing $-$ it's incredible how the layers perform specific segmentations and amplifications:

Thank you and hope your research is also going well! Paul

google / deepvariant

Deepvariant fails to recognize PacBio BAM index file #666