Add Quality Control Outputs

tsackton commented 2 years ago

This pipeline as it stands takes fastq files, and produces bams + filtered vcfs, using GATK standard hard filters. It produces a few quality control outputs, such as basic bam statistics and a few things like SNPs per interval, but we do not produce extensive QC reports or output.

At a minimum, we should add calculations of mappability, coverage, and callable site bed files (using customizable rules for mappability and coverage cutoffs). These will aid in filtering and are necessary for any additional downstream steps. This requires at least the following:

[x] A rule or rules to calculate per-sample coverage from each bam file in a compressed format. Potentially also use this to produce a summed coverage file and a mean coverage file. Space is an issue as even in bigBedGraph format these can be large files. May want to consider smoothing coverage values for better compression and efficiency of bedGraphs.
[x] A rule or rules to calculate mappability for each reference genome and produce a bedGraph of mappability scores
[x] A rule or rules to take coverage and mappability bedGraphs and produce a callable sites bed file.

A second level of quality control that would be valuable is an assessment of individual-level problems, ideally to identify individuals with excess missing data or other departures from expectations. This could also include screening for related individuals. Ideally this step would output a variety of sample lists that can be used in downstream filitering, e.g. one of just the high quality unrelated individuals, and one of all high quality individuals regardless of relatedness. First steps here might be:

[x] A rule or rules to compute missingness per individual and plot missingness vs coverage.
[x] Code that tries to automate the determination of "bad" individuals based on configurable cutoffs and deviations from regression line (e.g., individuals with too much missing data relative to predicted for coverage may be problematic even if the absolute level of missingness is not too high)
[x] A rule or rules to compute relatedness, and ideally also identify clusters of relatives and select the best based on some objective criteria (missing data?) of the set to retain for analyses that exclude relatives. Methods to calculate relatedness depend to a certain extent on coverage so this may be complicated to implement smoothly and generically.

Another level of quality control would be to implement some basic analyses that can help spot problems. The simplest option here is probably PCA, but various STRUCTURE-like options may also be of interest.

[x] A rule or rules to run and plot PCA.
- Need to decide whether haploidifying low-coverage data is worthwhile here.
- Also need to decide how tunable to make LD pruning.
[x] A rule or rules to run and plot ADMIXTURE or something similar, perhaps also with pong-style visualizations.

Finally, at the moment we do not provide much in the way of tunable filtering options or other assessments of site-based quality, beyond fixed GATK filter commands and the callable sites bed file. There are of course other options for extending site-based filtering in various ways:

[x] Add rule to plot distributions of filtering statistics? May be too complicated and of little value, at least until we move towards better variant filtering options. But could also be useful to start to see when filtering options don't do much. This is probably best done in conjunction with, e.g. Ts/Tv rate and things like that.
[x] Add rule to implement other site-filtering metrics, such as departures from HWE. Need to be careful here as population structure could be present in many sample sets. Might be a way to use results from PCA/STRUCTURE to define groups to test for HWE departures in, but would likely get hard to automate.

tsackton commented 2 years ago

Adding some notes based on Zoom discussion 9/20.

Erik has code that already handles a fair bit of the plotting and downstream analysis: PCA, missingness vs depth, distributions of filtering statistics. The plan is for @erikenbody and @cademirch to pick the commits that add this code and submit a pull request to add this functionality to the main repository.
The coverage calculations need to be ported from old code in the manuscript repository and probably rewritten.
Relatedness calculations is on deck to be written next.

tsackton commented 2 years ago

Most of this is closed by #33

Remaining issues involve computing coverage and mappability. We need to make a few decisions here. In particular:

Coverage bedgraphs can get large and consume significant space, especially if we want to store per-individual coverage as well as pooled coverage. There are a few options we could consider here:

Treat bedgraphs as temporary files, and only storage the "good coverage", "high coverage", and "low coverage" bed files. This is very space efficient, but means that changing the thresholds for low/good/high coverage would require recomputing bedgraphs.
Come up with a smoothing function to store coverage in a compressed way, so that we have fewer bedgraph intervals and thus smaller files.
Store only the pooled coverage bedgraph, potentially smoothed, which will be (much) smaller than the per-individual data.

My initial feeling is that we should probably just treat these as temporary files, keeping only the output beds that divide the genome into regions of low/good/high coverage. The compute time to rerun the compute coverage rule if we want to change the cutoffs is probably cheaper than the storage to keep the temporary files would be.

However, I think that some kind of smoothing function is the most elegant solution, so if anyone has a clever idea there I'd be interested in hearing it.

Whether we keep coverage bedgraphs or not, we need to decide on a heuristic to mark regions of the genome as high coverage, low coverage, or good coverage. This is a little challenging because the answer probably depends a bit on the average per-individual coverage in the experiment.
We need to decide how to weight mappability vs coverage for filtering, and also whether we should use other data (e.g., departures from HWE in regions of aberrantly high coverage, assembly quality statistics) to guide this decision. With a perfect assembly, coverage data isn't really that necessary and mappability would probably be fine on its own, but of course a mappability score cannot diagnosis problematic SNPs due to assembly errors.

tsackton commented 2 years ago

Adding another comment, as after running a few of these pipelines I am starting to think that some other QC would be useful to see.

One that would be very useful is just some kind of crude SNPs/bp plot, as that can be a quick visual way to identify problematic scaffolds. This probably needs the mappability and coverage beds done first to ensure that completely filtered windows don't end up problematic. However, a simple first step would be to add a rule that computes SNPs per bp and add a plot to the QC output that summarizes this per chromosome/scaffold (perhaps ignoring scaffolds less than some size threshold), or perhaps per interval. That would be a quick way to see problems.

Also, I think some basic filtering data would be useful, just a bit of information on how many snps are filtered, and perhaps Ts/Tv ratios for different classes of filtered sites or something. I think we can leave more detailed exploration of filtering to a later project, though.

harvardinformatics / snpArcher

Add Quality Control Outputs #16