Closed mbdabrowska1 closed 7 months ago
Your depth is fairly excessive. The default configuration of htslib (the interface to BAM files) sets a default number of reds that are stored whilst forming a pileup. The depth statistic is calculated from this. IIRC the default is 8000, in accord with your DP number. The other numbers are not calculated from the htslib pileup code, so they are not masked by this parameter.
Is it possible to know the actual depth at a cetain position, aka how many reads in the sample align at that nucleotide? Or would that just be the sum of the alleles from the SR statistic?
The DPSP field is the depth of spanning reads calculated from iterating over all reads spanning the region: https://github.com/nanoporetech/medaka/blob/cbf182f2aa7cd14165f02758f9386b8aed3aff4b/medaka/vcf.py#L1284
The DP field is the number of reads that were used in the calculation of the variants; capped to ~8000 due to the way the pileup is calculated when performing variant calling.
Medaka's variant calling algorithm is trained to work at depths up to 200-fold coverage.
I am running medaka_haploid_variant on my viral dataset and when I look at the output annotated VCF file I notice that the sum of all the reads in the SR column is a lot higher than the depth at that position (DP). As far as I'm understanding DR is all the reads at that position (reference+variant), meanwhile SR counts the reads by strand which best align to each allele so it separates them by the allele they support as well as by strand. How can SR have more reads than DP?
This is example annotated file I receive:
Any help with understanding what the DP and SR values actually are would be greatly appreciated.