petrelharp / ftprime_ms

4 stars 2 forks source link

Check output without selection against msprime and theory #11

Closed ashander closed 6 years ago

ashander commented 7 years ago
  1. I had to scale the mutation and recombination rate to the same length as the tree from the sim
  2. Can reproduce by running the code below to make the tree in forward time
python ./benchmark-simuPOP.py --theta 10 --rho 10 --nsam 10 --pdel 0 --gc 50 \
      --outfile1 out.log.gz --seed 42 --popsize 100 --treefile tree.hdf5
ashander commented 7 years ago

Clearly the results don't match

ashander commented 7 years ago

But then, neither do the results from fwdpy11_arg_example ( 3ca8820 ). And our script and @molpopgen 's are producing tree sequences with different sequence lengths

molpopgen commented 7 years ago

But then, neither do the results from fwdpy11_arg_example ( 3ca8820 ). And our script and @molpopgen 's are producing tree sequences with different sequence lengths

So, this is disturbing. If you skip simulating mutations in msprime, and instead add them after the fact to the tree, are you able to get the correct distribution?

It seems we need to work through the scaling of things, at the very least. I'm guessing @jeromekelleher can help.

ashander commented 7 years ago

This is with adding them after the fact (in the same way I was doing with ftprime). I hacked your benchmark a little to do this:

https://github.com/ashander/fwdpy11_arg_example/blob/check-trees/benchmarking.py

On Sat, Nov 4, 2017 at 4:06 PM, Kevin R. Thornton notifications@github.com wrote:

But then, neither do the results from fwdpy11_arg_example ( 3ca8820 https://github.com/petrelharp/ftprime_ms/commit/3ca882045561003ab868aba083d3b384741f49be ). And our script and @molpopgen https://github.com/molpopgen 's are producing tree sequences with different sequence lengths

So, this is disturbing. If you skip simulating mutations in msprime, and instead add them after the fact to the tree, are you able to get the correct distribution?

It seems we need to work through the scaling of things, at the very least. I'm guessing @jeromekelleher https://github.com/jeromekelleher can help.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petrelharp/ftprime_ms/pull/11#issuecomment-341936413, or mute the thread https://github.com/notifications/unsubscribe-auth/AAfLONnvpzHSc7I_FT3bbB6csTUWKRn0ks5szO3kgaJpZM4QSGhC .

ashander commented 7 years ago

And yes, I'm sure Jerome can help :D. I'll make a little table of what we're doing in each case w/r/to mutation rate, recomb rate, etc and add that to this PR

On Sat, Nov 4, 2017 at 4:12 PM, Jaime Ashander jashander@ucdavis.edu wrote:

This is with adding them after the fact (in the same way I was doing with ftprime). I hacked your benchmark a little to do this:

https://github.com/ashander/fwdpy11_arg_example/blob/ check-trees/benchmarking.py

On Sat, Nov 4, 2017 at 4:06 PM, Kevin R. Thornton < notifications@github.com> wrote:

But then, neither do the results from fwdpy11_arg_example ( 3ca8820 https://github.com/petrelharp/ftprime_ms/commit/3ca882045561003ab868aba083d3b384741f49be ). And our script and @molpopgen https://github.com/molpopgen 's are producing tree sequences with different sequence lengths

So, this is disturbing. If you skip simulating mutations in msprime, and instead add them after the fact to the tree, are you able to get the correct distribution?

It seems we need to work through the scaling of things, at the very least. I'm guessing @jeromekelleher https://github.com/jeromekelleher can help.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petrelharp/ftprime_ms/pull/11#issuecomment-341936413, or mute the thread https://github.com/notifications/unsubscribe-auth/AAfLONnvpzHSc7I_FT3bbB6csTUWKRn0ks5szO3kgaJpZM4QSGhC .

molpopgen commented 7 years ago

This is with adding them after the fact (in the same way I was doing with ftprime). I hacked your benchmark a little to do this:

Ok. And are you getting the correct mean S in msprime? It should be theta*\sum_{i=1}^{n-1}\frac{1}{i}

ashander commented 7 years ago

On Sat, Nov 4, 2017 at 4:14 PM, Kevin R. Thornton notifications@github.com wrote:

This is with adding them after the fact (in the same way I was doing with ftprime). I hacked your benchmark a little to do this:

Ok. And are you getting the correct mean S in msprime? It should be theta*\sum_{i=1}^{n-1}\frac{1}{i}

What sets the $n$ here?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petrelharp/ftprime_ms/pull/11#issuecomment-341936775, or mute the thread https://github.com/notifications/unsubscribe-auth/AAfLOGXLs1yq_AebXMiwSaymYRqAOLQlks5szO_ugaJpZM4QSGhC .

molpopgen commented 7 years ago

What sets the $n$ here?

Sample size. Formula also assumes n << 2N, where N is pop'n size.

ashander commented 7 years ago

It's off by about 10x. I have a feeling all this may relate to our old friendly scaling warning at the end of this section of the msprime api docs,

This parameterisation of recombination, mutation and migration rates is different to ms, which states these rates over the entire region and in coalescent time units. The motivation for this is to allow the user change the size of the simulated region without having to rescale the recombination and mutation rates, and to also allow users directly state times and rates in units of generations. However, the mspms command line application is fully ms compatible.

On Sat, Nov 4, 2017 at 4:24 PM, Kevin R. Thornton notifications@github.com wrote:

What sets the $n$ here?

Sample size. Formula also assumes n << 2N, where N is pop'n size.

Gotcha. Well regardless, it seems the mean S as I'm calculating it (which you can see on those figures is like ~2500) is not matching theta * sum(1/ i for i in range(1, 10) =~ 28.2. (That's for theta=10, which is the value being passed to our scripts, but in the check script I'm passing msprime a value of mutation_rate=theta/4. --- this clearly wouldn't match either.)

I have to check out soon but will follow up on this stuff a bit more tomorrow.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/petrelharp/ftprime_ms/pull/11#issuecomment-341937221, or mute the thread https://github.com/notifications/unsubscribe-auth/AAfLOIY9WdTRl70C3mk__kIKxbfu_tBuks5szPJAgaJpZM4QSGhC .

jeromekelleher commented 7 years ago

Ahh, good old popgen time scaling gymnastics!

OK, so part of the problem here is that we're not converting the rates into generations correctly for msprime. All time units and rates in msprime is measured in generations. This means multiple/dividing by 4Ne in various places, and this is what the Ne parameter to msprime.simulate is used for. I find it a lot less confusing to just plug in some Ne, sequence lengths and reasonable recombination/mutation rates rather than trying to remember what sensible values of the compound parameters like theta etc are (I accept this makes me a pathetic popgen failure!).

This is where I do the rate/time conversions in msprime, it might be helpful:

https://github.com/jeromekelleher/msprime/blob/master/lib/msprime.c#L3396

petrelharp commented 7 years ago

Just to record things:

... and atm I am forgetting if Ne is diploid or haploid popuation size. Diploid, I think?

ashander commented 7 years ago

Some additional detail to what @petrelharp just wrote down.

Units

parameter msprime ftprime fwdpy11_arg_example
time generations generation generations generations
genome base pairs base pairs Continuous interval [0,1) [3]
recombination rate 1/base 1/generation 1/ind 1/base * 1/generation [1] Poisson process with mean rho/4N [4]
mutation rate 1/base 1/generation 1/ind 1/base * 1/generation [2] Poisson process with mean theta/4N [4]

[0] msprime is per individual [1] I think the above is true regardless of if we use distance-based intensity= or uniform rates= parameterizations in simuPOP. The test runs were done with rates=.

[2] We use an SNPMutator and apply it once per generation. There, the rate "specifies the probability at which each allele mutates to another" so it's per allele, but we apply it to all available loci so it's per base-pair.

[3] This Hudson's "ms" here.

[4] The number of mutations (per gamets) and recombination breakpoints (per diploid) are Poisson processes with positions occurring along the unit interval. fwdpy11 relaxes this, allowing any range of positions, and variation in the processes, but that is all disallowed in fwdpy11_arg_example (but will be allowed when this stuff makes its way into fwdpy11)

Scalings

parameter msprime ftprime fwdpy11_arg_example
time generations ? ? generations
recombination rate per unit length per generation ? per region
mutation rate per unit length per generation ? per region
ashander commented 7 years ago

@molpopgen @petrelharp @jeromekelleher please feel free to edit my table in the above. (As Peter and I learned elsewhere, you can edit other people's comments.)

Although I'm not sure what happens if there are race conditions. I won't edit it further for this morning.

petrelharp commented 7 years ago

This is not a big deal, but it'll be less confusing if we always use the same name for the same thing, avoiding mutation_rate and using mutation_rate_per_bp and theta, for instance.

molpopgen commented 7 years ago

I did some testing using my benchmarking script. I made a copy called distS.py that prints out the length of the mutation table. (I've not added that to the repo--this is just a simple change for testing.

Then, generate seeds:

for i in $(seq 1 1 1000)
do
echo $RANDOM
done > SEEDS

And then run 1000 sims:

parallel --jobs 20 python distS.py --popsize 1000 -T 100 -R 100 --pdel 0 --nsam 10 -G 1000 --outfile1 foo --simlen 20 -S :::: SEEDS > distStheta100.txt 

It isn't done yet, but enough has been done to show that I'm getting very close to the correct E[S]. In R:

x=scan("distStheta100.txt");mean(x)
Read 498 items
[1] 285.0562
100*sum(1/(1:9))
282.8968

So, this shows that I had figured out the correct scaling already, and had forgotten. The time units in the sim are in generation, and the simplified tree is with respect to gametes, so I pass theta/4N as the mutation rate, which is the mutation rate per gamete, per generation.

However, I did notice something. If I only ran for 10N generations, I get too few segregating sites, which I why I do a simlen of 20 for this test. This is kind of a problem, as it implies that the handling of incompletely coalesced trees is not equivalent to simulating the neutral mutations themselves.

@jeromekelleher -- thoughts on this?

jeromekelleher commented 7 years ago

However, I did notice something. If I only ran for 10N generations, I get too few segregating sites, which I why I do a simlen of 20 for this test. This is kind of a problem, as it implies that the handling of incompletely coalesced trees is not equivalent to simulating the neutral mutations themselves.

Is the problem here that when you run for 10N generations then the samples are not fully coalesced and so you have trees with multiple roots? There's not much we can do then as we'll be missing some long root branches and we will get what you are seeing.

The only way to be sure I think is to check if there are trees with multiple roots after simplify. I meant to implement this but ran out of time. I can implement it though if it's important here, I don't think it's very hard to do.

jeromekelleher commented 7 years ago

Re the tables above @ashander.

It might be more helpful to think of rates in msprime as per unit of sequence length rather than base pairs (although it's basically the same thing). We can think of recombination and mutation events as happening in the continuous interval 0..L, where L is the sequence_length of the tree sequence. So, if we have a mutation rate of mu, then the expected number of mutation events for a single individual in a generation is L * mu.

I'm not sure if this makes it any clearer!

I've updated the bottom table to show this (but please do re-edit if it's not helpful).

molpopgen commented 7 years ago

Is the problem here that when you run for 10N generations then the samples are not fully coalesced and so you have trees with multiple roots? There's not much we can do then as we'll be missing some long root branches and we will get what you are seeing.

Yes, this is the issue.

The only way to be sure I think is to check if there are trees with multiple roots after simplify. I meant to implement this but ran out of time. I can implement it though if it's important here, I don't think it's very hard to do.

I think the easier thing in the long run will be to go back to the original idea of seeding a sim with a tree. I think I understand the API scaling well enough to test that out.

jeromekelleher commented 7 years ago

I think the easier thing in the long run will be to go back to the original idea of seeding a sim with a tree. I think I understand the API scaling well enough to test that out.

This will guarantee everything coalesces all right.

ashander commented 7 years ago

Note that for checks we should be sure to pass the correct length, Ne and recombination rate to msprime.simulate when setting up the initial tree https://github.com/petrelharp/ftprime_ms/blob/27ad3e983365d45619f9b531d550cd68a9c2355e/sims/benchmark-simuPOP.py#L219-L222

Our sims with ftprime will currently produce a sequence with length equal to the number of basepairs: https://github.com/petrelharp/ftprime_ms/blob/27ad3e983365d45619f9b531d550cd68a9c2355e/sims/benchmark-simuPOP.py#L155-L166 (I think in fwdpy this would be length = 1.0 and the rates scaled appropriately.)

On Mon, Nov 6, 2017 at 12:36 PM, Jerome Kelleher notifications@github.com wrote:

I think the easier thing in the long run will be to go back to the original idea of seeding a sim with a tree. I think I understand the API scaling well enough to test that out.

This will guarantee everything coalesces all right.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petrelharp/ftprime_ms/pull/11#issuecomment-342278371, or mute the thread https://github.com/notifications/unsubscribe-auth/AAfLOJYXrhC-UePUF-lF-zpoMp4-Q3Duks5sz23igaJpZM4QSGhC .

molpopgen commented 7 years ago

Looks like I've got the scalings, etc., correct. I will look into updating my repo to allow an initial TreeSequence to get passed in.

ashander commented 7 years ago

Update: all is not yet well with getting the ftprime scaled correctly.

I've updated my check script (which runs a bunch of msprime sims and compares a single ftprime run). It was looking promising for values like rho=10, theta=10, N=100 but bumping up theta/rho I see ftprime well below the distribution from msrpime: 46e818a

I'm also running a check similar to @molpopgen 's (also with theta=rho=100) on our cluster.

ashander commented 7 years ago

Further update: :(

When mostly done with the check against expected $S = \theta \sum_i^{n-1} 1/i$ (where I'm counting variants like print(len([i for i in mutated_ts.variants()]))), it's clear we're overestimating with simuPOP +ftprime

for i in $(seq 1 1 100)
do
echo $RANDOM
done > seeds.txt
readarray SEEDS < seeds.txt

python3 ./benchmark-simuPOP.py --popsize 1000 --theta 100 --rho 100 --pdel 0 \
  --nsam 10 -G 1000 --outfile1 /dev/null --simlen 10 \
  --logfile log_${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}.log  \
  --seed ${SEEDS[${SLURM_ARRAY_TASK_ID}]} -n \
 > distStheta100_${SLURM_ARRAY_JOB_ID}_${SLURM_ARRAY_TASK_ID}.txt
$ cat distStheta100_232181_*txt  > tmp && \
     Rscript -e 'scan("tmp") -> x ; mean(x); 100 * sum(1/(1:9))'
Read 94 items
[1] 355.8723
[1] 282.8968
ashander commented 7 years ago

Click below to see diversity, sequence length, num trees and num mutations for a sample of the tree sequences generated.

rho=theta=100 ``` Mean pairwise diversity: 0.0004381280683017469 Sequence length: 249999.0 Number of trees: 336 Number of mutations: 382 All done! -- Mean pairwise diversity: 0.0003201486490156487 Sequence length: 249999.0 Number of trees: 300 Number of mutations: 345 All done! -- Mean pairwise diversity: 0.000360780390489983 Sequence length: 249999.0 Number of trees: 287 Number of mutations: 345 All done! -- Mean pairwise diversity: 0.0004461702057334545 Sequence length: 249999.0 Number of trees: 293 Number of mutations: 398 All done! -- Mean pairwise diversity: 0.00041821219916458614 Sequence length: 249999.0 Number of trees: 277 Number of mutations: 403 All done! -- Mean pairwise diversity: 0.00047200188800755204 Sequence length: 249999.0 Number of trees: 378 Number of mutations: 428 All done! -- Mean pairwise diversity: 0.00039183314627995355 Sequence length: 249999.0 Number of trees: 288 Number of mutations: 353 All done! -- Mean pairwise diversity: 0.0004200227327225098 Sequence length: 249999.0 Number of trees: 290 Number of mutations: 343 All done! -- Mean pairwise diversity: 0.00043019119444898834 Sequence length: 249999.0 Number of trees: 320 Number of mutations: 419 All done! ```
molpopgen commented 7 years ago

Further update: :(

When mostly done with the check against expected $S = \theta \sum_i^{n-1} 1/i$ (where I'm counting variants like print(len([i for i in mutated_ts.variants()]))), it's clear we're overestimating with simuPOP +ftprime

Edit after reading the above:

are you seeding with a tree from msprime? If so, if you run for 20N generations, do you get the right distribution. If so, then you're not getting the initial params right in msprime. If not, then there's a bug in ftprime somewhere.

molpopgen commented 7 years ago

Out of a sudden sense of paranoia, I'm doing tests now based on N=5e3, theta = rho = 1,000, and a sample size of 100. I'll post the results tomorrow.

ashander commented 7 years ago

This was

On Tue, Nov 7, 2017 at 5:55 PM, Kevin R. Thornton notifications@github.com wrote:

Out of a sudden sense of paranoia, I'm doing tests now based on N=5e3, theta = rho = 1,000, and a sample size of 100. I'll post the results tomorrow.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petrelharp/ftprime_ms/pull/11#issuecomment-342684649, or mute the thread https://github.com/notifications/unsubscribe-auth/AAfLOEIqUikdeY_XlmhYKqktjqC-KMsLks5s0QoNgaJpZM4QSGhC .

molpopgen commented 7 years ago

I ran the N=5,000, n=100, theta=rho=1,000 case. It seems to mostly check out, but I think I need to look carefully at the code. I don't like that mean S is off a bit with 1,000 replicates:

> x=scan("distStheta1000_with_seeding.txt");mean(x)
Read 1000 items
[1] 5163.845
> y=scan("msprimeS.t1000.out");mean(y)
Read 1000 items
[1] 5177.03
> ks.test(x,y)

    Two-sample Kolmogorov-Smirnov test

data:  x and y
D = 0.053, p-value = 0.1205
alternative hypothesis: two-sided

Warning message:
In ks.test(x, y) : p-value will be approximate in the presence of ties

> quantile(x)
  0%  25%  50%  75% 100% 
4547 5046 5165 5283 5673 
> quantile(y)
  0%  25%  50%  75% 100% 
4690 5067 5173 5288 5692
petrelharp commented 7 years ago

re: Jaime's check above --

$ cat distStheta100232181txt > tmp && \ Rscript -e 'scan("tmp") -> x ; mean(x); 100 sum(1/(1:9))' Read 94 items [1] 355.8723 [1] 282.8968

The comparison should be to 100 * sum(1/(1:19)) = 354.774 because there are 10 diploid samples. So that looks good.

petrelharp commented 7 years ago

It looks like the discrepancies above are due to diploid samples != haploid samples? Kevin, you are working with haploids, no?

molpopgen commented 7 years ago

It looks like the discrepancies above are due to diploid samples != haploid samples? Kevin, you are working with haploids, no?

I am pulling out haploid samples, by which I mean I am sampling n things without replacement from a list of 2N gametes.

ashander commented 7 years ago

Awesome thanks for going though the script and expectations with me Peter.

This also explains differences (e.g. in the figures in this PR) between our trees and sims with msprime. The factor of 2 is inside the number of samples!

On Wed, Nov 8, 2017 at 11:31 AM Peter Ralph notifications@github.com wrote:

It looks like the discrepancies above are due to diploid samples != haploid samples? Kevin, you are working with haploids, no?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petrelharp/ftprime_ms/pull/11#issuecomment-342932591, or mute the thread https://github.com/notifications/unsubscribe-auth/AAfLOJSx-qG57n1mDYDpbl9z0OiWdlWYks5s0gF5gaJpZM4QSGhC .

-- -Jaime

petrelharp commented 7 years ago

I am pulling out haploid samples, by which I mean I am sampling n things without replacement from a list of 2N gametes.

... and Jaime is pulling diploid samples.

molpopgen commented 7 years ago

'

... and Jaime is pulling diploid samples.

Interesting.

I am also finding that I need to set Ne = 2N in msprime.simulate. Doing so makes my distribution of S match up better.

ashander commented 7 years ago

I am also finding that I need to set Ne = 2N in msprime.simulate. Doing so makes my distribution of S match up better.

but in msprime.simulate, Ne= is supposed to be diploid, sample_size= however, just says individuals. because the coalescent is haploid I guess that means haploid but maybe @jeromekelleher can shed some light

petrelharp commented 7 years ago

... and Jaime is pulling diploid samples.

Interesting.

This is because simuPOP is diploid (also dioecious, btw).

I am also finding that I need to set Ne = 2N in msprime.simulate. Doing so makes my distribution of S match up better.

msprime is haploid, so you should definitely set Ne to be whatever your total haploid population size is.

ashander commented 7 years ago

msprime is haploid, so you should definitely set Ne to be whatever your total haploid population size is.

As mentioned above, I think sample_size is haploid but Ne definitely is diploid in msprime.simulate

petrelharp commented 7 years ago

Ne being diploid agrees with a quick check:

>>> ts = msprime.simulate(100, Ne=10, recombination_rate=10)
>>> sum([t.total_branch_length * (t.interval[1] - t.interval[0]) for t in ts.trees()])
206.74305981988397
>>> 2 * 10 * sum([1/k for k in range(1,100)])
103.54755035279241
>>> 2 * (2 * 10) * sum([1/k for k in range(1,100)])
207.09510070558483
molpopgen commented 7 years ago

Ne being diploid agrees with a quick check:

Perhaps I have my recombination rate off by a factor of 2, then.

molpopgen commented 7 years ago

Ne being diploid agrees with a quick check:

My code to generate the initial TreeSequence is here:

https://github.com/molpopgen/fwdpy11_arg_example/blob/seed_with_simulation/fwdpy11_arg_example/evolve_arg.py#L51

As mentioned above, the K-S tests "pass", but my mean S is off by about 10, which I know from experience is too big of a discrepancy for 1,000 replicates and the K-S test results are misleading.

If I change the Ne to 2N, however, the mean S is dead on. If I don't divide the recombination rate by 2 and leave Ne = N, the discrepancy returns.

jeromekelleher commented 7 years ago

I'm not sure I can be of much help here guys. Unfortunately, I took an 'ms is always right' approach to this sort of stuff, and just followed what Hudson did in terms of scalings. I really wish everyone had just said 2Ne instead of 4Ne, but I guess we're stuck with it now. I know msprime outputs the same distributions as ms, via a truckload of qqplots like this one:

coalescent_t

molpopgen commented 7 years ago

I am going to check Hudson and Kaplan's Rmin, which is a sure-fire way to tell if your recombination stuff is buggy. To get it from a TreeSequence:

import msprime
import libsequence.msprime
import libsequence.summstats

N = 5000.0
rate = 1000.0

for i in range(1000):
    t = msprime.simulate(100,recombination_rate=rate/(4.*N),mutation_rate=rate/(4.*N),Ne=N)
    S = t.get_num_mutations()
    d = libsequence.msprime.make_SimData(t)
    ad = libsequence.summstats.PolySIM(d)
    minrec = ad.rm()
    print(S, minrec)
ashander commented 7 years ago

Cool. Although based on Peter's earlier correction it does still seem like we have the correct total tree length vs theory ...

based on number of trees alone I think there's still something strange with recombination in my output (vs msprime). I am cleaning up my testing scripts. I'll also add Rmin as a data point and report back

molpopgen commented 7 years ago

Cool. Although based on Peter's earlier correction it does still seem like we have the correct total tree length vs theory ...

based on number of trees alone I think there's still something strange with recombination in my output (vs msprime) as well. I am cleaning up my testing scripts. I'll also add Rmin as a data point and report back

My test will take about one day. I should also compare to regular fwdpp output.

ashander commented 7 years ago

Update: I'm trying to nail down this issue at the ftprime level first, see https://github.com/ashander/ftprime/pull/54

petrelharp commented 6 years ago

why do you divide recomb rate by 2 at

https://github.com/molpopgen/fwdpy11_arg_example/blob/seed_with_simulation/fwdpy11_arg_example/evolve_arg.py#L53 @molpopgen?

On Wed, Nov 8, 2017 at 1:51 PM, ashander notifications@github.com wrote:

Update: I'm trying to nail down this issue at the ftprime level first, see ashander/ftprime#54 https://github.com/ashander/ftprime/pull/54

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petrelharp/ftprime_ms/pull/11#issuecomment-342972449, or mute the thread https://github.com/notifications/unsubscribe-auth/AA_26Yyfp3Rijk-YEwqCi0T5vqjW6S7mks5s0iJfgaJpZM4QSGhC .

molpopgen commented 6 years ago

Per diploid -> per haploid/gamete.

On Thu, Nov 9, 2017, 9:23 AM Peter Ralph notifications@github.com wrote:

why do you divide recomb rate by 2 at

https://github.com/molpopgen/fwdpy11_arg_example/blob/seed_with_simulation/fwdpy11_arg_example/evolve_arg.py#L53 @molpopgen?

On Wed, Nov 8, 2017 at 1:51 PM, ashander notifications@github.com wrote:

Update: I'm trying to nail down this issue at the ftprime level first, see ashander/ftprime#54 https://github.com/ashander/ftprime/pull/54

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/petrelharp/ftprime_ms/pull/11#issuecomment-342972449 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AA_26Yyfp3Rijk-YEwqCi0T5vqjW6S7mks5s0iJfgaJpZM4QSGhC

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petrelharp/ftprime_ms/pull/11#issuecomment-343224891, or mute the thread https://github.com/notifications/unsubscribe-auth/AGHnH88duE5DWLn4mP6uJHMvDB7Hvg5zks5s0zLZgaJpZM4QSGhC .

--

Kevin Thornton

Associate Professor

Ecology and Evolutionary Biology

UC Irvine

http://www.molpopgen.org

http://github.com/ThorntonLab

http://github.com/molpopgen

molpopgen commented 6 years ago

Below is the output from comparing my forward sim seeded with a TreeSequence to the output from msprime. The agreement is very good, and so I'm declaring it "working". The parameters are N=5,000 with theta = rho = 1,000, with a sample size of 100 chromosomes. '

> n=commandArgs(trailing=T)
> print(n)
[1] "distStheta1000_with_seeding_scaling1.txt"
[2] "msprimeS.t1000.out"                      
> x=read.table(n[1])
> y=read.table(n[2])
> print("Dist of S")
[1] "Dist of S"
> print(paste(mean(x$V1),mean(y$V1)))
[1] "5171.509 5177.936"
> print(ks.test(x$V1,y$V1))

    Two-sample Kolmogorov-Smirnov test

data:  x$V1 and y$V1
D = 0.04, p-value = 0.4005
alternative hypothesis: two-sided

Warning message:
In ks.test(x$V1, y$V1) :
  p-value will be approximate in the presence of ties
> print(quantile(x$V1))
     0%     25%     50%     75%    100% 
4671.00 5070.00 5177.00 5276.25 5650.00 
> print(quantile(y$V1))
     0%     25%     50%     75%    100% 
4721.00 5065.00 5177.00 5287.75 5630.00 
> print("Dist of Rmin")
[1] "Dist of Rmin"
> print(paste(mean(x$V2),mean(y$V2)))
[1] "498.595 497.246"
> print(ks.test(x$V2,y$V2))

    Two-sample Kolmogorov-Smirnov test

data:  x$V2 and y$V2
D = 0.037, p-value = 0.5004
alternative hypothesis: two-sided

Warning message:
In ks.test(x$V2, y$V2) :
  p-value will be approximate in the presence of ties
> print(quantile(x$V2))
    0%    25%    50%    75%   100% 
428.00 484.00 498.00 511.25 586.00 
> print(quantile(y$V2))
  0%  25%  50%  75% 100% 
 429  483  497  511  569 
> 
> 
petrelharp commented 6 years ago

Per diploid -> per haploid/gamete.

On Thu, Nov 9, 2017, 9:23 AM Peter Ralph notifications@github.com wrote:

why do you divide recomb rate by 2 at

Ah ha. I think we do not make this correction, but should. do you agree, @ashander?

molpopgen commented 6 years ago

Ah ha. I think we do not make this correction, but should. do you agree, @ashander?

This is a subtle point about the parameters in these models and what we are trying to do, which is to generate 2N gametes as if they come from a WF pop'n of size N. Forwards in time, r determines the number of breakpoints per diploid, and we're trying to approximate a population of size 2N via a sample of size 2N (the genealogy of the former and the latter are not the same [1]). So, we need to split each of our 2N haploids at rate r/2.

[1] NB this has implications for the applicability of what we're doing. If one seeds with a TreeSequence and then evolves forwards in time for a short amount of time, the site frequency spectrum will be a bit off, as msprime is not allowing the multiple mergers that should be in the genealogy of an entire population.