nanoporetech / medaka

Sequence correction provided by ONT Research
https://nanoporetech.com
Other
391 stars 73 forks source link

Sum of SR is higher than DP for medaka_haploid_variant output file #466

Closed mbdabrowska1 closed 7 months ago

mbdabrowska1 commented 8 months ago

I am running medaka_haploid_variant on my viral dataset and when I look at the output annotated VCF file I notice that the sum of all the reads in the SR column is a lot higher than the depth at that position (DP). As far as I'm understanding DR is all the reads at that position (reference+variant), meanwhile SR counts the reads by strand which best align to each allele so it separates them by the allele they support as well as by strand. How can SR have more reads than DP?

This is example annotated file I receive:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMPLE
K03455.1    2286    .   A   T   6.752   PASS    AR=3117,4253;DP=8154;DPS=7472,682;DPSP=31216;SC=2733774,3093599,2772118,3132438;SR=3527,4090,7775,8454  GT:GQ   1:7
K03455.1    2295    .   A   G   10.928  PASS    AR=1474,1262;DP=8156;DPS=7472,684;DPSP=31227;SC=2616252,3139157,2652581,3183650;SR=3342,4068,9605,11476 GT:GQ   1:11
K03455.1    2327    .   T   C   5.657   PASS    AR=275,360;DP=8179;DPS=7478,701;DPSP=31254;SC=2950027,3533242,2961624,3532945;SR=6445,8241,7704,8229    GT:GQ   1:6
K03455.1    2360    .   G   A   22.384  PASS    AR=379,351;DP=8199;DPS=7481,718;DPSP=31292;SC=3100238,3592675,3188468,3695060;SR=3222,3938,10833,12569  GT:GQ   1:22
K03455.1    2374    .   G   A   13.256  PASS    AR=220,698;DP=8207;DPS=7484,723;DPSP=31309;SC=3033693,3508212,3102217,3581492;SR=3262,3977,10955,12197  GT:GQ   1:13
K03455.1    2440    .   T   C   32.913  PASS    AR=252,530;DP=8233;DPS=7485,748;DPSP=30560;SC=2757616,3182227,2842424,3272936;SR=409,938,13506,14925    GT:GQ   1:33
K03455.1    2450    .   C   T   41.362  PASS    AR=187,899;DP=8232;DPS=7481,751;DPSP=30556;SC=2712765,3116148,2835446,3247670;SR=71,304,13902,15193 GT:GQ   1:41

Any help with understanding what the DP and SR values actually are would be greatly appreciated.

cjw85 commented 7 months ago

Your depth is fairly excessive. The default configuration of htslib (the interface to BAM files) sets a default number of reds that are stored whilst forming a pileup. The depth statistic is calculated from this. IIRC the default is 8000, in accord with your DP number. The other numbers are not calculated from the htslib pileup code, so they are not masked by this parameter.

mbdabrowska1 commented 7 months ago

Is it possible to know the actual depth at a cetain position, aka how many reads in the sample align at that nucleotide? Or would that just be the sum of the alleles from the SR statistic?

cjw85 commented 7 months ago

The DPSP field is the depth of spanning reads calculated from iterating over all reads spanning the region: https://github.com/nanoporetech/medaka/blob/cbf182f2aa7cd14165f02758f9386b8aed3aff4b/medaka/vcf.py#L1284

The DP field is the number of reads that were used in the calculation of the variants; capped to ~8000 due to the way the pileup is calculated when performing variant calling.

Medaka's variant calling algorithm is trained to work at depths up to 200-fold coverage.