paula-tataru / polyDFE

predicting DFE and alpha from polymorphism data
GNU General Public License v3.0
28 stars 0 forks source link

About downsampling and incomplete SFS #5

Closed chenyangkang closed 3 years ago

chenyangkang commented 3 years ago

Hi there,

I can now successfully run the grogram now. But I got a really high value in the alpha estimation (over 0.99). I believe there's something wrong with my unfolded SFS.

My SFS file goes like this:

1 1 20 540 73 65 43 37 24 21 21 37 157 22 14 6 2 1 0 0 0 0 667755 3614 667755 355 64 39 29 20 11 19 22 40 249 30 6 3 5 5 4 0 0 0 222585 5084 222585

There wasn't enough alleles observed when d=17, 18 and 19, so I have to downsample it. But I don't know how to match the downsampled SFS with the haplotype number(20 in this case). I put it zero where alleles counts zero. Is it right? I think this will bias the result hugely.

Any advise?

Thanks a lot, Yangkang

paula-tataru commented 3 years ago

Adding zeros to the SFS will bias the results, yes. You have to down-sample the data as described in the tutorial:

Alternatively, projection methods can be used to down-sample the SNP data to build a complete SFS with a reduced number of samples [9, 10].

  1. Nielsen R, Bustamante C, Clark AG, Glanowski S, Sackton TB, Hubisz MJ, Fledel-Alon A, TanenbaumDM, Civello D, White TJ, Sninsky JJ, Adams MD, Cargill M (2005) A scan for positively selectedgenes in the genomes of humans and chimpanzees. PLoS Biology 3(6):e170.
  2. James JE, Piganeau G, Eyre-Walker A (2016) The rate of adaptive evolution in animal mitochondria.Molecular Ecology 25(1):67–78.

There doesn't seem to be a full consensus on how to do this “properly“ besides those two references. The so called hypergeometric projection is the standard approach.

You might also find some useful pointers on how to do this using py here: https://speciationgenomics.github.io/easysfs/