isaacovercast / easySFS

Effective selection of population size projection for construction of the site frequency spectrum. Convert VCF to dadi/fastsimcoal style SFS for demographic analysis
129 stars 23 forks source link

Issue when there are a lot of missed data in VCF file #38

Closed noscode closed 3 years ago

noscode commented 3 years ago

Hi @isaacovercast,

Thank you very much for easySFS!

I have faced a problem when VCF file contains missed data. I had a VCF file with dots (.) instead of information about alleles (0|1) and build SFS of small size with easySFS. Also I build a full SFS of maximum size just in case. Then I decided to downsize my full spectrum to small size and compare it with those from easySFS. And they does not match. I am not sure what is going on but I think that it happens because of data dict contains frequencies without any information about missed data.

I have created a small example of that behavior and attached the notebook here.

Best regards, Ekaterina

noscode commented 3 years ago

I thought about it a little more and run more tests. As I understand when there is missing data and we build the whole spectrum then we loose all SNP's with missing data. When we build smaller SFS we downsize the full spectra and also add information about missing data if it has SNP's with at least the required (according to the size of SFS) number of individuals. I mean that if we have SNP with known 6 alleles in population and 4 missed alleles then it will not be taken for SFS of size 10 but it will be counted for SFS of size 6. I want to say that now I think that easySFS is doing the correct thing and it is okay that it is not equal to usual downsizing. Please tell me if I am right or not.

isaacovercast commented 3 years ago

@noscode You are very close. Lets assume we have a dataset with some fraction of missing data. The standard sfs can not be calculated with missing data, so if we retain all samples then we must remove all sites with any missing values before calculating the sfs. The "down-projection" strategy is a way of retaining more sites with missing data. This is accomplished by averaging over all possible combinations of allele frequencies at the given sampling level. In your example, if you have 10 (haploid) samples, with 4 missing sites you could down-project from 10 to 8 samples and easySFS will consider all possible combinations of allelic values from the six sampled individuals at the two unsampled (missing) sites. It seems like you basically have the right idea, i hope my explanation clarifies a bit more.

Good luck, -isaac