petrelharp / ftprime_ms

4 stars 2 forks source link

Simplify perf #21

Closed jeromekelleher closed 6 years ago

jeromekelleher commented 6 years ago

Adds benchmark code for

  1. Simplify wrt to input sample size. The data is included here so we can make a plot later. I've plotted the output below: pretty compelling evidence for logarithmic dependency on the input sample size.
  2. Comparing calculation of pi with a numpy implementation.

simplify_subsample_perf

jeromekelleher commented 6 years ago

Sorry for all the noise in the diff: my vim setup automatically removes trailing whitespace...

jeromekelleher commented 6 years ago

I've implemented the figure as discussed over at #20, and put in various other concrete benchmark figures. I think it's important to be concrete and give nice solid figures to emphasise that this stuff really is pretty fast.

I'm not sure what happened to the Algorithm S bits --- I had a bit of an adventure rebasing this on master. Let me know if I need to change this stuff back. [Edited: updated PR to remove unwanted edits]

Here's the final(ish) figure for reference

simplify_benchmarks

petrelharp commented 6 years ago

One last plug for adding another few lines to (A) -- right now, I don't know how much it might matter to simplify right down to the sample we are interested in at the end of a simulation, as opposed to simplifying down to the entire final population. We could simplify down to samples of size 10 and 100, having three lines, for instance.

Partly, this is probably just me trying to make the figure more interesting. =)

But, if you think it's better as-is, lmk, I'll merge.

jeromekelleher commented 6 years ago

One last plug for adding another few lines to (A) -- right now, I don't know how much it might matter to simplify right down to the sample we are interested in at the end of a simulation, as opposed to simplifying down to the entire final population. We could simplify down to samples of size 10 and 100, having three lines, for instance.

I thought it was pretty clear from (B) though? Using a sample of size 100 will take a bit under 1/2 the time and 10 will take a bit under half that time again. We'll just get three parallel lines, which is only very slightly less dull than what we have! (I agree the figure is boring, but only because it's such an excellent fit to our expectations.)

What do you think @molpopgen and @ashander?

It's simple to add a few lines and I'm happy to follow the consensus here.

petrelharp commented 6 years ago

One last plug for adding another few lines to (A) -- right now, I don't know how much it might matter to simplify right down to the sample we are interested in at the end of a simulation, as opposed to simplifying down to the entire final population. We could simplify down to samples of size 10 and 100, having three lines, for instance.

I thought it was pretty clear from (B) though? Using a sample of size 100 will take a bit under 1/2 the time and 10 will take a bit under half that time again. We'll just get three parallel lines, which is only very slightly less dull than what we have! (I agree the figure is boring, but only because it's such an excellent fit to our expectations.)

What do you think @molpopgen https://github.com/molpopgen and @ashander https://github.com/ashander?

It's simple to add a few lines and I'm happy to follow the consensus here.

If it's simple, could you do it? I don't think that's what's going to happen, but will be happy to be wrong.

jeromekelleher commented 6 years ago

If it's simple, could you do it? I don't think that's what's going to happen, but will be happy to be wrong.

OK, doing it now.

ashander commented 6 years ago

Cool I'm interested to see it too

On Mon, Nov 20, 2017 at 12:34 Jerome Kelleher notifications@github.com wrote:

If it's simple, could you do it? I don't think that's what's going to happen, but will be happy to be wrong.

OK, doing it now.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petrelharp/ftprime_ms/pull/21#issuecomment-345822475, or mute the thread https://github.com/notifications/unsubscribe-auth/AAfLOB2iX8hdTIoyIWyUo2QYnxYk6iPAks5s4eHkgaJpZM4Qh3HF .

-- -Jaime

jeromekelleher commented 6 years ago

Well, looks like I was wrong here!

simplify_num_edges

Basically, input sample size has very little effect for this type of input. The off-linear behaviour for k=10 is just some stochastic effect from the way I've constructed the test. Having sample size k=N made sure things were nice and predictable, but this is thrown off by having k so small. We'll get rid of this in the final figure, but I thought I'd share this much before putting more work in.

What do you think @petrelharp? Is this what you'd like to show, or is there some other formulation that would be more useful?

jeromekelleher commented 6 years ago

Either way, I think I'd prefer to merge this much first and revisit this figure a bit later when I have a bit more time to look at the benchmark.

molpopgen commented 6 years ago

Basically, input sample size has very little effect for this type of input. The off-linear behaviour for k=10 is just some stochastic effect from the way I've constructed the test. Having sample size k=N made sure things were nice and predictable, but this is thrown off by having k so small. We'll get rid of this in the final figure, but I thought I'd share this much before putting more work in.

I'm not understanding how this was generated. There was a single simulated replicate, correct? That means simplifying down to the most recent 2N samples should be a single point, not a line?

petrelharp commented 6 years ago

As I suspected! This is different than in the coalescent-generated history because the time is dominated by all the irrelevant edges, I think. I'd vote for including these lines in the plot except for that wierd stochastic thing. So, I guess I vote for either (a) omit the n=10 case or (b) average over several n=10 cases.

Merging, anyhow.

jeromekelleher commented 6 years ago

I'm not understanding how this was generated. There was a single simulated replicate, correct? That means simplifying down to the most recent 2N samples should be a single point, not a line?

We're just using the single replicate as a large set of edges, and then time how long it takes to simplify subsets of this set of edges. Whether the samples have fully coalesced or not is irrelevant to simplify.

Thanks for merging @petrelharp, I'll have another pass at this figure at some point. Probably averaging over several random samples for each N is the best thing to do. Interesting to think about why the log time component is so weak in this case, there might be room for a bit of optimisation...