lczech / grenedalf

Toolkit for Population Genetic Statistics from Pool-Sequenced Samples, e.g., in Evolve and Resequence experiments
GNU General Public License v3.0
35 stars 2 forks source link

BAMs vs VCFs for pooled mitochondrial variants using Kofler or Karlsson #15

Closed MarinaSci closed 5 months ago

MarinaSci commented 1 year ago

Dear Lucas, I have been using your tool ‘Grenedalf’ for analysing pooled data, and I would like to first thank you for putting such a great tool together!

I work with pooled data (coming from shotgun metagenomics sequencing of faecal samples infected with parasitic eggs or pools of eggs directly) and I am interested in analysing both nuclear and mitochondrial data to understand genetic variation within those parasites. I work directly on stool extracts mostly, so any downstream analysis is based on allele frequencies not genotypes.

I have used both VCF and BAM files to calculate Fst and theta pi nucleotide diversity from a couple of samples, and I am getting different results depending on whether I use BAM or VCF files for the same number of positions to be considered (n=229 mitochondrial variants, after filtering for missingness, quality etc). You have explained previously how you recommend to use BAM files as opposed to VCFs and that makes sense, so I will switch to using bam files moving forward.

First Q: For calculating Fst, when I use the ‘Karlsson’ method (as I am unsure of my pool size = I have hundreds of eggs in each environmental sample) with a VCF, then the number of SNPs that are considered for the analysis is n=93 but it is 95 if I am using the corresponding BAM files. However, the total number of SNPs in the VCF is 229 and they are all biallelic from looking at the VCF. When I switch to the ‘Kofler’ method, for the BAM files, the number of SNPs to be considered is 156 (!) but for the VCF is still 93. I am unsure why ~ 73 positions are considered ‘Not SNPs’ from ‘Grenedalf’ (during the Kofler/BAM method), though they appear to be SNPs in my VCF and why when I use the ‘Karlsson’ method, then the tool throws out some ‘non biallelic SNPs’ (n=61) as well as the ‘no SNPs’ (n=73). Are there any assumptions made by ‘Grenedalf’? Do the BAM files hold information about the reads (like indel positions etc even though they have been filtered out) that makes ‘Grenedalf’ to ignore them? The VCFs I have used to test these have been filtered for min/max alleles (2 for both, bcftools), and then the positions have been filtered for high quality (based on custom QUAL distributions and for 0.7 max-missingness). Does it have to do with the fact I am using mitogenome data (ploidy of 1?).

Command line for pi: grenedalf diversity --sam-path $i --filter-sample-min-count 2 --filter-sample-min-coverage 1 --filter-region-list $k --window-type chromosomes --pool-sizes 1000 --file-prefix ${nickname}_${name}_pi_ Command line for Fst: grenedalf fst --sam-path $i $j --method karlsson --filter-region-list 0.7_pools.recode.kept.sites --window-type chromososomes --file-prefix ${i}_${j}_

Second Q: Is there a way to calculate within sample pi diversity when you don't know the pool size?

Third Q: Is it possible to generate a collated table from when running Fst with BAMs? I have 88 bams and all the possible combos generate over 3,000 CSVs that have to be merged later...

Thank you for all the fantastic work and responsiveness! Marina

lczech commented 1 year ago

Dear Marina,

I have been using your tool ‘Grenedalf’ for analysing pooled data, and I would like to first thank you for putting such a great tool together!

Thank you :-)

You have explained previously how you recommend to use BAM files as opposed to VCFs and that makes sense, so I will switch to using bam files moving forward.

Sounds good!

First Q:

For calculating Fst, when I use the ‘Karlsson’ method (as I am unsure of my pool size = I have hundreds of eggs in each environmental sample)...

The Karlsson estimator is kind of the same as our Hudson (as explained in the equations supplement here), but with infinite pool size, and only taking biallelic SNPs into account. The pool size correction applied in the estimators is mostly relevant for small pool sizes (because that's where the sampling noise is high, and where the correction hence matters). But it has almost no effect for larger pools - on the order of 1/poolsize. In your case, with fecal samples, I think it's fair to assume that we have a very large pool size, meaning that the correction won't be needed anyway, so you can also use the other estimators, and just set the pool size to some large number.

...with a VCF, then the number of SNPs that are considered for the analysis is n=93 but it is 95 if I am using the corresponding BAM files. However, the total number of SNPs in the VCF is 229 and they are all biallelic from looking at the VCF.

Hard to tell without seeing the file, but are all those 229 variants actually SNPs, or could they be indels as well? As of now, grenedalf ignores positions that are not SNPs. Furthermore, it could also be that in all your populations, the variant is the same. Say, your reference genome as a C in a position, but all your samples have a T. In that case, as far as grenedalf is concerned, that's not a variant - all samples have the same base there. There might be other things happening as well that might cause that - if you want, you can share the file (or parts of it) here, and I can have a look.

When I switch to the ‘Kofler’ method, for the BAM files, the number of SNPs to be considered is 156 (!) but for the VCF is still 93. I am unsure why ~ 73 positions are considered ‘Not SNPs’ from ‘Grenedalf’ (during the Kofler/BAM method), though they appear to be SNPs in my VCF and why when I use the ‘Karlsson’ method, then the tool throws out some ‘non biallelic SNPs’ (n=61) as well as the ‘no SNPs’ (n=73).

Hm, I'm not sure that I can fully follow the numbers there. However, as teased above, the Karlsson estimator only consideres biallelic positions (which is why you see the number of "non biallelic SNPs" reported there), while all other estimators can also work with multiallelic SNPs (unless --filter-total-only-biallelic-snps is provided). Can you check your VCF and see if the ALT column has rows with multiple alternative alleles listed? Do the numbers make sense when you take this into account?

Are there any assumptions made by ‘Grenedalf’? Do the BAM files hold information about the reads (like indel positions etc even though they have been filtered out) that makes ‘Grenedalf’ to ignore them?

Grenedalf tries to make sense of the information that each format provides, as best as possible. BAM files contain the reads, mapped to the reference genome, so yes, they contain information about indels. However, indels are ignored by grenedalf - the read is still used, but only the positions in it that align with the reference (according to whatever mapping tool you used), and the bases in the read that were considered indels (by the mapping tool) are ignored. Does that make sense?

The VCFs I have used to test these have been filtered for min/max alleles (2 for both, bcftools), and then the positions have been filtered for high quality (based on custom QUAL distributions and for 0.7 max-missingness).

Well, as mentioned in https://github.com/lczech/grenedalf/issues/14, VCFs are highly processed, and follow the statistical assumptions of the variant caller you used, which is usually meant for diploid individuals, and not for pools of sequences. Hence, the filtering applied to the VCF might kinda help, but using callers for individuals on pools might introduce bias already, so not sure if this really solves the issue. Better to work with the BAM files, as you already do ;-)

Does it have to do with the fact I am using mitogenome data (ploidy of 1?).

Hm don't think so, that should not matter in your case. With ploidy 1, your pool of sequences basically can be interpreted in a way where each haplotype corresponds to some individual. With higher ploidy, that would not be possible, as each individual would have multiple haplotypes in their genome (unless they are all homozygous), so you would not be able to tell which haplotypes originally came from which individual. But statistically for the analyses in grenedalf, that should not matter - it just takes the reads as one big pool.

Ploidy does make a difference though for the pool size. From our equations document (latest version as of now):

The pool sequencing approach assumes the pool size to be the number of distinct haplotypes in the population that have been sampled. Hence, for instance for 100 diploid individuals, the number of haplotypes (and hence the pool size) is 200.

But with your "infinite" pool size, that isn't really relevant here.

Command line for pi: grenedalf diversity --sam-path $i --filter-sample-min-count 2 --filter-sample-min-coverage 1 --filter-region-list $k --window-type chromosomes --pool-sizes 1000 --file-prefix ${nickname}_${name}_pi_ Command line for Fst: grenedalf fst --sam-path $i $j --method karlsson --filter-region-list 0.7_pools.recode.kept.sites --window-type chromososomes --file-prefix ${i}_${j}_

Looks good, but --filter-sample-min-coverage 1 doesn't do anything. A sample with a coverage of 0 would be empty anyway, and hence not used in the statistic.

Second Q:

Second Q: Is there a way to calculate within sample pi diversity when you don't know the pool size?

Yes, as mentioned, in your case, with such big pool size, the correction is irrelevant, so just use a big number (1000000 or so). The corrections for coverage are still used then.

Third Q:

Third Q: Is it possible to generate a collated table from when running Fst with BAMs? I have 88 bams and all the possible combos generate over 3,000 CSVs that have to be merged later...

Sure. In your command above, you are calling the command with just one pair of samples. You can also give it all your samples at once (or for simplicity, just the directory with all of them, if they are in the same). Then, all pairwise fst will be computed and printed in one table. You can optionally also subset that with the --comparand, --second-comparand, and --comparand-list options, if you do not want the pairwise all-to-all comparison.

Hope that helps, and let me know if you have follow up questions!

Cheers Lucas

MarinaSci commented 1 year ago

Dear Lucas,

All crystal clear, thank you for explaining everything with such granularity!

Yes, as mentioned, in your case, with such big pool size, the correction is irrelevant, so just use a big number (1000000 or so). The corrections for coverage are still used then.

I have tested this between 1000 and 1000000 and did not see much of difference for my data; thanks for the suggestion!

I am attaching 2 VCF files for one of the species I am looking at (with the aforementioned 229 SNPs). The number of the SNPs is the same, but the number of individuals per file changes.

  1. ALUM_bcftools_mtDNA_HIGHQUAL_n500_FORMAT_AD_removeddups.recode_MAX_MISS_0.7_w_filtered_Indiv_copy.recode.vcf.gz - contains all individuals (those that come from single worms and pooled data)
  2. ALUM_bcftools_mtDNA_HIGHQUAL_n500_nodups_MAX_MISSING_0.7_pools.recode_copy.vcf.gz - contains only those samples that have come from 'pooled' data.

Both VCFs have been filtered for min/max alleles (=2) and max-missingness 0.7.

I am dealing with both individuals (single worms) and pooled data (faecal extracts with hundreds of eggs) and so far my plan is to split these samples and run pixy (or equivalent) for the individuals (and deal with genotypes) and use Grenedalf for the pooled data. Ideally, I would like to compare all at the end. So, I was wondering.. Would you advise against combining both types of datasets in the calculations (Fst or Pi) with Grenedalf? Since Grenedalf deals with BAMs I was wondering if it would be ok to combine both to make comparisons easier... ?

Thanks again for you time!

lczech commented 1 year ago

Hey Marina,

so, that whole topic of how to treat missing data is a bit tricky, see the discussions at https://github.com/lczech/grenedalf/issues/4, https://github.com/lczech/grenedalf/issues/5, and https://github.com/lczech/grenedalf/issues/7... I'm not quite sure yet what to make out of this, and want to implement a better treatment of that in grenedalf in the future.

For now, I think combining the datasets would make it easier for you, in terms of tools etc. Looking at the files you shared, it seems that there are only 90 instances of zero coverage across all samples and positions, right? So, 90 out of 229 x 17 = 3893, so you have ~2.3% missing data? I'm not sure how sensitive your analysis would be to that - maybe that is small enough that the effect is acceptable? But please check - I'm not a statistician, and might be wrong about that.

Sorry that I cannot say anything more than that, but hope that helps Lucas

lczech commented 5 months ago

Hi @MarinaSci,

finally following up on this. I just released grenedalf v0.5.0, which contains a lot of improvements that might help you with your analysis:

As for your pooled vs individual data comparison: I think you could do that, and for the individual data, just set a high pool size (1mio), which will in fact act as if there is no sampling noise from the pooling, and hence have the same effect as not using the pool size correction at all. Bit counter-intuitive, but I think that should work. If this is interesting for others, I could in theory also implement a special case for pool size 1 to deactivate the correction. But not sure if that would be more useful or more confusing...

Generally, I'd recommend to read through the new General usage part of the wiki. I hope that explains all that you need - if not, feel free to ask! Going to close this issue for now though, as I think the questions here have hopefully been answered now :-)

Cheers and so long Lucas