Closed jeromekelleher closed 6 years ago
Sorry for all the noise in the diff: my vim setup automatically removes trailing whitespace...
I've implemented the figure as discussed over at #20, and put in various other concrete benchmark figures. I think it's important to be concrete and give nice solid figures to emphasise that this stuff really is pretty fast.
I'm not sure what happened to the Algorithm S bits --- I had a bit of an adventure rebasing this on master. Let me know if I need to change this stuff back.
[Edited: updated PR to remove unwanted edits]
Here's the final(ish) figure for reference
One last plug for adding another few lines to (A) -- right now, I don't know how much it might matter to simplify right down to the sample we are interested in at the end of a simulation, as opposed to simplifying down to the entire final population. We could simplify down to samples of size 10 and 100, having three lines, for instance.
Partly, this is probably just me trying to make the figure more interesting. =)
But, if you think it's better as-is, lmk, I'll merge.
One last plug for adding another few lines to (A) -- right now, I don't know how much it might matter to simplify right down to the sample we are interested in at the end of a simulation, as opposed to simplifying down to the entire final population. We could simplify down to samples of size 10 and 100, having three lines, for instance.
I thought it was pretty clear from (B) though? Using a sample of size 100 will take a bit under 1/2 the time and 10 will take a bit under half that time again. We'll just get three parallel lines, which is only very slightly less dull than what we have! (I agree the figure is boring, but only because it's such an excellent fit to our expectations.)
What do you think @molpopgen and @ashander?
It's simple to add a few lines and I'm happy to follow the consensus here.
One last plug for adding another few lines to (A) -- right now, I don't know how much it might matter to simplify right down to the sample we are interested in at the end of a simulation, as opposed to simplifying down to the entire final population. We could simplify down to samples of size 10 and 100, having three lines, for instance.
I thought it was pretty clear from (B) though? Using a sample of size 100 will take a bit under 1/2 the time and 10 will take a bit under half that time again. We'll just get three parallel lines, which is only very slightly less dull than what we have! (I agree the figure is boring, but only because it's such an excellent fit to our expectations.)
What do you think @molpopgen https://github.com/molpopgen and @ashander https://github.com/ashander?
It's simple to add a few lines and I'm happy to follow the consensus here.
If it's simple, could you do it? I don't think that's what's going to happen, but will be happy to be wrong.
If it's simple, could you do it? I don't think that's what's going to happen, but will be happy to be wrong.
OK, doing it now.
Cool I'm interested to see it too
On Mon, Nov 20, 2017 at 12:34 Jerome Kelleher notifications@github.com wrote:
If it's simple, could you do it? I don't think that's what's going to happen, but will be happy to be wrong.
OK, doing it now.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petrelharp/ftprime_ms/pull/21#issuecomment-345822475, or mute the thread https://github.com/notifications/unsubscribe-auth/AAfLOB2iX8hdTIoyIWyUo2QYnxYk6iPAks5s4eHkgaJpZM4Qh3HF .
-- -Jaime
Well, looks like I was wrong here!
Basically, input sample size has very little effect for this type of input. The off-linear behaviour for k=10 is just some stochastic effect from the way I've constructed the test. Having sample size k=N made sure things were nice and predictable, but this is thrown off by having k so small. We'll get rid of this in the final figure, but I thought I'd share this much before putting more work in.
What do you think @petrelharp? Is this what you'd like to show, or is there some other formulation that would be more useful?
Either way, I think I'd prefer to merge this much first and revisit this figure a bit later when I have a bit more time to look at the benchmark.
Basically, input sample size has very little effect for this type of input. The off-linear behaviour for k=10 is just some stochastic effect from the way I've constructed the test. Having sample size k=N made sure things were nice and predictable, but this is thrown off by having k so small. We'll get rid of this in the final figure, but I thought I'd share this much before putting more work in.
I'm not understanding how this was generated. There was a single simulated replicate, correct? That means simplifying down to the most recent 2N samples should be a single point, not a line?
As I suspected! This is different than in the coalescent-generated history because the time is dominated by all the irrelevant edges, I think. I'd vote for including these lines in the plot except for that wierd stochastic thing. So, I guess I vote for either (a) omit the n=10 case or (b) average over several n=10 cases.
Merging, anyhow.
I'm not understanding how this was generated. There was a single simulated replicate, correct? That means simplifying down to the most recent 2N samples should be a single point, not a line?
We're just using the single replicate as a large set of edges, and then time how long it takes to simplify subsets of this set of edges. Whether the samples have fully coalesced or not is irrelevant to simplify.
Thanks for merging @petrelharp, I'll have another pass at this figure at some point. Probably averaging over several random samples for each N is the best thing to do. Interesting to think about why the log time component is so weak in this case, there might be room for a bit of optimisation...
Adds benchmark code for