Closed OliverPStuart closed 3 years ago
It's true that invariant sites are not typically contained in vcf files, and I haven't specifically tested vcfs with lots of invariant sites, but it should work fine, and the invariants should just end up contributing to the 0th bin.
I double-checked the accuracy of --preview and --proj and the preview and final sfs output counts are in agreement for my test data.
Can I see the full command line for both the --preview and --proj calls you're running? Can you show me the sfs output that you're talking about (with 1715 snps)? Can you show me a relevant snippet of the --preview results?
Sure thing.
Here is the --preview results saved as a two column table: easySFS_projection.txt
And here is the projected DAF: sanguinipes_DAFpop0.obs.txt
Cool, thanks, Can i also see the full command line calls?
~/easySFS/easySFS.py \
-i Sanguinipes_AllSites_Filt2.vcf \
-p easySFS_popfile --preview -a
~/easySFS/easySFS.py \
-i Sanguinipes_AllSites_Filt2.vcf \
-p easySFS_popfile \
--proj 50 -o easySFSproj50 \
--unfolded
Ok, well here's something, if you use the '-a' flag for the preview then it will include all snps in the vcf. Then when you call --proj without the -a
argument it will only sample one snp per CHROM. This would explain the very large difference between segsite counts in the preview and proj modes. Include the -a flag in the proj call or drop it from the preview call to get results that agree.
As a side note, easySFS makes a strong assumption that the data come from RADSeq like loci, where each CHROM in the vcf is essentially a RAD locus. If you have reference mapped data, then you'll need to use the -a flag. One of these days I'll implement a --window_size
argument to allow sampling snps from windows of fixed size in reference mapped data, but I just haven't done it yet.
Ah, thank you for spotting that and apologies for opening an issue without being thorough first.
Hi again,
Why does the total number of observations (the sum of the frequency bins) in the projected DAF not correspond to the number of sites specified by the projection preview?
I've projected my data down to 50 samples, this is a dataset from mpileup which contains both variant and invariant sites. I assume this is probably not what the program was designed to deal with. At 50 samples running
--preview
tells me that at 50 samples 12810 sites will be retained, but the sum of the frequency spectrum output is 1715.Am I missing something fundamental here, or is it an input problem?
Best, Ollie