popgenmethods / smcpp

SMC++ infers population history from whole-genome sequence data.
GNU General Public License v3.0
149 stars 32 forks source link

estimate of Watterson theta #219

Open Devashish13 opened 1 year ago

Devashish13 commented 1 year ago

Hi I am working with the SMC++ and having issue in understanding the Watterson's estimate of theta.

By looking at the code,

theta = 2N0(mu) 0.0001 = 2N0(mu)

so estimate of theta is already fixed and you are estimating N0 based on the parameter mu provided by user and scaling time in units of 2N0, if my understanding is correct.

while looking into code you have also estimated theta_hat but i don't see it in the model output file.

the question is if i use mutation rate say 1.25e-8 for human populations and scale coalescent times and other parameters in 2*N0=8000, is it correct?

what i expect is that N0 should vary across populations and scaling should be done with respect to respective population.

Kindly help me in getting the intuition behind this.

thanks, Devashish

andreaswallberg commented 1 year ago

Hi @Devashish13 ! Did you get an answer to this question?

Devashish13 commented 1 year ago

Hi @andreaswallberg , I haven't received any response on this issue.

Do you have any opinion on this?

Thanks

albanieto commented 1 year ago

Hello @Devashish13 and @andreaswallberg !! I was (am) having the same issue as you. And here is what I did to solve it: As far as I understood, regarding this answer from @terhorst (https://github.com/popgenmethods/smcpp/issues/43), the results on the "y" vector in the json file correspond to log(N(t)/N0), which, by means, would be the log(\lambda) of the PSMC. However, I don't find that the "N0" provided in the .json file is the same used to retrieve the Nt in the plot and, I assume, neither the theta (so by means, the usage of N0 and theta wouldn't be equivalent to \N0 and \theta0 in PSMC). Moreover, "theta" and "N0" are in all the models the same, so they are not an estimation but an scaling parameter, as @terhorst also answered in this other question (https://github.com/popgenmethods/smcpp/issues/4), although this issue is quite old.

What I did instead to obtain theta was to use the command plot with the flag "--csv". This will provide a csv file with the exact dataframe used to produce the plot. With these values of N (in the "y" column, which is not the same as the "y" field in the json by the way) to obtain theta using the classic formula of "theta=4Nmu" (N is diploid).

Just in case you know anything more about this or any suggestion, It would really help me, because I still do not understand how does this conversion from the .json file to the .csv file is performed.

Here is the plot command I used:

smc++ plot plot.png --csv model.final.json

And here I attach the model.json.file and its corresponding csv used to plot by the smc++ command just in case they could help.

The json: { "alpha": 100, "hidden_states": { "G_1": [ 0.0, 0.01982738953745406, 0.03110264603135564, 0.06916095605427722, 0.11263119461076172, 0.15297110023780527, 0.21612627210487956, 0.290612217888383, 0.35062442429756435, 0.41722052251666, 0.49384622604542405, 0.5974064057777699, 0.7262184819837451, 0.8876875956953854, 1.0468770666701805, 1.3398472881048509, 2.3989643955177486, Infinity ] }, "model": { "N0": 5000.0, "class": "SMCModel", "knots": [ 0.01982738953745406, 0.06916095605427722, 0.15297110023780527, 0.290612217888383, 0.41722052251666, 0.5974064057777699, 0.8876875956953854, 1.3398472881048509 ], "pid": "G_1", "spline_class": "Piecewise", "y": [ 0.8454882609319204, 0.052497943237970614, 0.06380435312837801, 0.14666462262690239, -0.11259461961525338, -0.1628059637227178, 0.3820524047483612, -0.02186170583811593 ] }, "rho": 0.00999561556631319, "theta": 0.0001 }

The plot:

plot

The csv:

(https://github.com/popgenmethods/smcpp/files/11100467/plot.csv)