isaacovercast / easySFS

Effective selection of population size projection for construction of the site frequency spectrum. Convert VCF to dadi/fastsimcoal style SFS for demographic analysis
124 stars 23 forks source link

Sum of frequencies on projected DAF #41

Closed OliverPStuart closed 3 years ago

OliverPStuart commented 3 years ago

Hi again,

Why does the total number of observations (the sum of the frequency bins) in the projected DAF not correspond to the number of sites specified by the projection preview?

I've projected my data down to 50 samples, this is a dataset from mpileup which contains both variant and invariant sites. I assume this is probably not what the program was designed to deal with. At 50 samples running --preview tells me that at 50 samples 12810 sites will be retained, but the sum of the frequency spectrum output is 1715.

Am I missing something fundamental here, or is it an input problem?

Best, Ollie

isaacovercast commented 3 years ago

It's true that invariant sites are not typically contained in vcf files, and I haven't specifically tested vcfs with lots of invariant sites, but it should work fine, and the invariants should just end up contributing to the 0th bin.

I double-checked the accuracy of --preview and --proj and the preview and final sfs output counts are in agreement for my test data.

Can I see the full command line for both the --preview and --proj calls you're running? Can you show me the sfs output that you're talking about (with 1715 snps)? Can you show me a relevant snippet of the --preview results?

OliverPStuart commented 3 years ago

Sure thing.

Here is the --preview results saved as a two column table: easySFS_projection.txt

And here is the projected DAF: sanguinipes_DAFpop0.obs.txt

isaacovercast commented 3 years ago

Cool, thanks, Can i also see the full command line calls?

OliverPStuart commented 3 years ago
~/easySFS/easySFS.py \
-i Sanguinipes_AllSites_Filt2.vcf \
-p easySFS_popfile --preview -a

~/easySFS/easySFS.py \
-i Sanguinipes_AllSites_Filt2.vcf \
-p easySFS_popfile \
--proj 50 -o easySFSproj50 \
--unfolded
isaacovercast commented 3 years ago

Ok, well here's something, if you use the '-a' flag for the preview then it will include all snps in the vcf. Then when you call --proj without the -a argument it will only sample one snp per CHROM. This would explain the very large difference between segsite counts in the preview and proj modes. Include the -a flag in the proj call or drop it from the preview call to get results that agree.

As a side note, easySFS makes a strong assumption that the data come from RADSeq like loci, where each CHROM in the vcf is essentially a RAD locus. If you have reference mapped data, then you'll need to use the -a flag. One of these days I'll implement a --window_size argument to allow sampling snps from windows of fixed size in reference mapped data, but I just haven't done it yet.

OliverPStuart commented 3 years ago

Ah, thank you for spotting that and apologies for opening an issue without being thorough first.