lczech / grenedalf

Toolkit for Population Genetic Statistics from Pool-Sequenced Samples, e.g., in Evolve and Resequence experiments
GNU General Public License v3.0
35 stars 2 forks source link

Columns in output file don't add up #34

Open Easybel opened 1 month ago

Easybel commented 1 month ago

Hello there!

I am running diversity on bam files to determine the diversity across the whole genome (--window-type genome) and expected the columns to add up to the total number of positions on my genome. My genome has 100.286.401 positions.

This is my code:

get the diversity for windows

$softPath/./grenedalf diversity \ --sam-path $bamPath/$ID".bam" \ --sam-min-map-qual 20 \ --reference-genome-fai $dictPath/WBcel235_ref.fasta.fai \ --pool-sizes 500 \ --window-type genome \ --window-average-policy valid-loci \ --filter-sample-min-count 2 \ --filter-sample-min-read-depth 4 \ --file-prefix $ID" genome " \ --out-dir $outPath/genome/

This is the way my output looks:

chrom | start | end | total.masked | total.missing | total.empty | total.numeric | total.invariant | total.passed | ID.missing | ID.numeric | ID.passed | ID.theta_pi | ID.theta_watterson | ID.tajimas_d -- | 0 | 0 | 0 | 0 | 253179 | 0 | 89739341 | 9458417 | 0 | 0 | 9458417 | 0.00404284457 | 0.0197056004 | -7.09882652

This is not the case and I think that I am missing the .missing positions?

Thank you!

lczech commented 1 month ago

Hi @Easybel,

ah yes, that is explained here. Please try running the analysis with --make-gapless if you want those numbers to show up as well. Then, the sum of the positions should match your reference genome length, but this should not change the resulting statistics values at all. If either of those things turn out not to be true, please let me know, as that would be a bug.

Bit of background explanation for this: By default, any position for which there is no entry in the input data is just completely ignored and not filled in for the positions counts or statistics, as it does not make a difference for the statistic (for the window averaging policy that you selected at least). That is faster than having to fill in missing positions artificially, only to then ignore them again. In the end, it is a bit of an implementation detail that leaked into the final interface - there might have been cleaner solutions for this. I have it on my list to revisit this at some point.

Hope that makes sense, and let me know if this works for you, or if you have any further questions!

Cheers Lucas