How to understand the statistics of the screen output？

biozzq commented 1 year ago

Dear @lczech,

I have four pooling sequencing libraries. I have finished SNP calling using bwa+GATK and a total of 9.8 M SNPs (after filtering for depth and missing rate) have been identified. I found that grenedalf can directly work with BAM and VCF format. So I first calculated the FST using BAM as the input. At the end of the programme, the following statistics were recorded. How to understand each statistic and which number better represents the number of SNPs? If 28568428 represents the number of SNPs, it is significantly different from 9.8 M.

Sample filter summary (summed up across all samples):
Passed:               4036079496
Empty (after counts): 32041114
Above max coverage:   263818
Total filter summary (after applying all sample filters):
Passed:                  28568428
Below min coverage:      51188538
Not SNP:                 937243195

Finished 2023-10-02 23:48:24

When using VCF as the input, the statistics looks much more normal. It seems like that 9795814 represents the number if SNPs after filtering by minor allele count > 0.

Processed 39 chromosomes with 9795814 (non-filtered) positions in 47245 windows.
Total filter summary (after applying all sample filters):
Passed:                  9795814
Not SNP:                 120324

Finished 2023-10-03 10:29:20

Sincerely, Zhuqing Zheng

lczech commented 1 year ago

Hey Zhuqing,

EDIT (2024-06-08): With the latest release of grenedalf v0.5.0, the below answer is outdated. Please refer to the up-to-date wiki instead for details on how the on screen statistics work now.

please excuse the late reply!

First, a breakdown of those numbers. The way this is implemented is that first, the per-sample filters are applied, i.e., everything that starts with --filter-sample-.... Then, on whatever remains after that, the filters are applied that work across all samples (i.e., on the sums of base counts of all samples), which are the ones that start with --filter-total-.... The statistics you are seeing there are from those two filtering steps.

The first group "Sample filter summary" gives the number of entries (per sample and per position in the genome) for which a particular filter was triggered (and hence excluded that sample at at that position from the downstream analysis). The second group "Total filter summary" is the same, but as that filter is applied across the summed counts of all samples, the numbers are simply per position.

Sample filter summary:

Empty (after counts) --> After applying --filter-sample-min-count and --filter-sample-max-count, how many samples ended up completely empty (where all base counts are zero)
Below min coverage, Above max coverage --> How many samples across all positions in the genome were filtered out due to --filter-sample-min-coverage and --filter-sample-max-coverage, respectively.
Passed --> Total number of entries (across all samples and positions again) that passed all the --filter-sample-... filters.

Total filter summary:

Below min coverage, Above max coverage --> How many positions in total were filtered out due to --filter-total-min-coverage and --filter-total-max-coverage, respectively.
Passed: How many positions in total passed all the filters. That's the number of positions in the genome that actually were use to compute fst.

There are some more numbers that might be reported, depending on which other filter options you set. But I hope that now those make sense now as well. Please excuse that this is not fully documented yet... Hope to get to work on this at some point!

Now, let's have a look at why are you seeing those numbers exactly.

VCF files contain highly processed data from some variant caller program, and typically only list the positions that this program deemed to be "variant" according to their particular statistical model (there are .g.vcf files as well that contain all positions). So, a VCF is the "essence" of whatever the variant caller program produced.
BAM files on the other hand are the output of a read mapping tool, and contain all your sequence reads, mapped against your reference genome. They hence contain data at all positions for which there is coverage in your data. When reading those files in grenedalf, we actually through all positions in the genome, and find all reads that span that position. If there are no reads, that is, if the position has 0 coverage depth, the "Empty" filter above kicks in, removing that position for downstream analyses.

Hence, what you are seeing there for the BAM file includes all positions in the genome, and not only the variant ones. BAM files are not concerned with the concept of a "variant" - they simply list bases at positions, with all sequencing errors etc. Hence, the sum of those three "Total" numbers should be somewhere in the order of the length of your reference genome: 28568428 + 51188538 + 937243195 = 1017000161. Does that roughly fit? It might not exactly fit, as empty positions at the very end of the BAM file would only be considered if you also specified one of the --reference-genome-... options, so that grenedalf could infer the length of each chromosome from that. If not provided, the last position for which the BAM file has a read would be considered the end of the chromosome.

So, if you want to use grenedalf with BAM files, the typical step is to set some of the min counts and min coverage options, to exclude samples and positions that have, for instance, a sequencing error. Without any min filters, those would lead to some non-zero count that is interpreted as a SNP otherwise. Typically, --filter-sample-min-count 2 might already be a good starting point. See https://doi.org/10.1371/journal.pone.0015925 and https://doi.org/10.1093/bioinformatics/btr589 for details on those numbers.

Alternatively, if you simply want to restrict the analyses in grenedalf to your known variants, you can use the --filter-region-vcf option with your VCF. That will simply subset everything to the positions that are listed in the VCF. Might be the easiest solution, if that's what you are aiming for.

Now you might think that using a VCF might be better then. But we highly advise against that, unless you used a variant caller that is specifically meant for pool sequencing data. If not (that is, if you just used GATK, freebayes, bcftools call, and the like), you will likely get biased and skewed calls, as those tools are not meant for pool seq data, and hence assume individual data, which is wrong. We offer to read VCF in grenedalf just in case, but typically, neither the format nor the callers are meant for pool seq data. We show this at length in https://doi.org/10.1101/2022.02.02.477408, see there for details. Which variant caller program did you use?

In summary, my recommendation is to use the BAM files, as those are the more reliable source for pool seq data, but set the count and coverage filters appropriately, or restrict to your positions of interest.

Please let me know if this all makes sense, and if you have any questions about this. We are currently working on more publications to show these things in more detail, and I'd also recommend to read the equations supplement of the grenedalf preprint (https://doi.org/10.48550/arXiv.2306.11622) for more insights.

Cheers Lucas

lczech commented 5 months ago

Hey @biozzq,

big update on this! In the latest release grenedalf v0.5.0, I have completely re-design the filtering approach, and hence also the output of the filter statistics. Please use the new version now, as it also solves several other issues related to window averaging etc that are relevant with the filter settings.

You now also find more information on the screen summary of the filters here. See the General Usage pages as well for information on all the other processing steps.

I think that should answer your original question, and hence I'm going to close the issue. If you have more questions though, feel free to re-open or start a new issue!

Cheers and so long Lucas

lczech / grenedalf

How to understand the statistics of the screen output？ #14