harvardinformatics / snpArcher

Snakemake workflow for highly parallel variant calling designed for ease-of-use in non-model organisms.
MIT License
63 stars 30 forks source link

Add Quality Control Outputs #16

Closed tsackton closed 2 years ago

tsackton commented 2 years ago

This pipeline as it stands takes fastq files, and produces bams + filtered vcfs, using GATK standard hard filters. It produces a few quality control outputs, such as basic bam statistics and a few things like SNPs per interval, but we do not produce extensive QC reports or output.

At a minimum, we should add calculations of mappability, coverage, and callable site bed files (using customizable rules for mappability and coverage cutoffs). These will aid in filtering and are necessary for any additional downstream steps. This requires at least the following:

A second level of quality control that would be valuable is an assessment of individual-level problems, ideally to identify individuals with excess missing data or other departures from expectations. This could also include screening for related individuals. Ideally this step would output a variety of sample lists that can be used in downstream filitering, e.g. one of just the high quality unrelated individuals, and one of all high quality individuals regardless of relatedness. First steps here might be:

Another level of quality control would be to implement some basic analyses that can help spot problems. The simplest option here is probably PCA, but various STRUCTURE-like options may also be of interest.

Finally, at the moment we do not provide much in the way of tunable filtering options or other assessments of site-based quality, beyond fixed GATK filter commands and the callable sites bed file. There are of course other options for extending site-based filtering in various ways:

tsackton commented 2 years ago

Adding some notes based on Zoom discussion 9/20.

  1. Erik has code that already handles a fair bit of the plotting and downstream analysis: PCA, missingness vs depth, distributions of filtering statistics. The plan is for @erikenbody and @cademirch to pick the commits that add this code and submit a pull request to add this functionality to the main repository.
  2. The coverage calculations need to be ported from old code in the manuscript repository and probably rewritten.
  3. Relatedness calculations is on deck to be written next.
tsackton commented 2 years ago

Most of this is closed by #33

Remaining issues involve computing coverage and mappability. We need to make a few decisions here. In particular:

  1. Coverage bedgraphs can get large and consume significant space, especially if we want to store per-individual coverage as well as pooled coverage. There are a few options we could consider here:

My initial feeling is that we should probably just treat these as temporary files, keeping only the output beds that divide the genome into regions of low/good/high coverage. The compute time to rerun the compute coverage rule if we want to change the cutoffs is probably cheaper than the storage to keep the temporary files would be.

However, I think that some kind of smoothing function is the most elegant solution, so if anyone has a clever idea there I'd be interested in hearing it.

  1. Whether we keep coverage bedgraphs or not, we need to decide on a heuristic to mark regions of the genome as high coverage, low coverage, or good coverage. This is a little challenging because the answer probably depends a bit on the average per-individual coverage in the experiment.

  2. We need to decide how to weight mappability vs coverage for filtering, and also whether we should use other data (e.g., departures from HWE in regions of aberrantly high coverage, assembly quality statistics) to guide this decision. With a perfect assembly, coverage data isn't really that necessary and mappability would probably be fine on its own, but of course a mappability score cannot diagnosis problematic SNPs due to assembly errors.

tsackton commented 2 years ago

Adding another comment, as after running a few of these pipelines I am starting to think that some other QC would be useful to see.

One that would be very useful is just some kind of crude SNPs/bp plot, as that can be a quick visual way to identify problematic scaffolds. This probably needs the mappability and coverage beds done first to ensure that completely filtered windows don't end up problematic. However, a simple first step would be to add a rule that computes SNPs per bp and add a plot to the QC output that summarizes this per chromosome/scaffold (perhaps ignoring scaffolds less than some size threshold), or perhaps per interval. That would be a quick way to see problems.

Also, I think some basic filtering data would be useful, just a bit of information on how many snps are filtered, and perhaps Ts/Tv ratios for different classes of filtered sites or something. I think we can leave more detailed exploration of filtering to a later project, though.