lczech / grenedalf

Toolkit for Population Genetic Statistics from Pool-Sequenced Samples, e.g., in Evolve and Resequence experiments
GNU General Public License v3.0
34 stars 2 forks source link

Sub-pooling suggestions #33

Open alexis-sedg opened 1 week ago

alexis-sedg commented 1 week ago

Hello,

I'm interested in using this pipeline for pooled data. However, when we designed the study, we used the sub-pooling method recommended by CRISP (variant caller). So instead of having a singular BAM file for a given population, I have multiple. I assume I can use the downstream VCF or mpileup as part of your pipeline but I'd prefer to use the BAMs as inputs. Is there a way to go about using multiple BAMs for a given sample population in the Grenedalf pipeline?

Thank you, Alexis

lczech commented 1 week ago

Hi Alexis @alexis-sedg,

interesting approach! Do you have a reference or link to that sub-pooling procedure? Why is that the recommendation for that variant caller?

Yes, grenedalf can do that, using the --sample-group-merge-table option that is provided for most of the commands. You could also merge the bam files into one bam per sample, e.g., with samtools merge if you want that instead. As you say, working on downstream VCFs or mpileups is not the best approach - VCFs are not well suited for pooled data in the first place, and pileup is just a waste of disk space as far as grenedalf is concerned.

Hope that helps, so long Lucas

alexis-sedg commented 5 days ago

Hi Lucas,

Yeah, happily! The paper it's from is "A statistical method for the detection of variants from next-generation resequencing of DNA pools" https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2881398/ or alternatively the GitHub https://github.com/vibansal/crisp/tree/master

My understanding is that they use the comparison between multiple replicate pools of the same population to distinguish sequence errors from rare alleles. The explanation they provided was: "In the absence of a variant, the frequency of the reads with a nucleotide different from the reference base at a particular position should be similar across multiple pools. The intuition being that sequencing errors, especially those that depend upon the local sequence context, are likely to be shared across reads in multiple pools. In contrast, presence of a rare variant in a pool is expected to result in an excess of reads with the alternate allele as compared with the other pools. We use a contingency table approach to compute a P-value for the null hypothesis in the absence of a SNP (see [Fig. 1] for an illustration of this idea)."

Excellent, I'll give the merge function a go! Read quality and quantity is variable across my data, even between samples from the same groups. Are there any additional considerations or recommendations you have to deal with the variability or is it alright to run the merge function as is?

Thanks for your time, Alexis