petrelharp / ftprime_ms

4 stars 2 forks source link

First pass at script to compute space complexity. #18

Closed jeromekelleher closed 6 years ago

jeromekelleher commented 6 years ago

Here's a first pass at a script to look at the space complexity of using algorithm W. The idea is that we take a very simple case, where we just simplify after every generation, and take our space complexity as the number of edges in the table.

@petrelharp, can you have a quick look through the algorithm and see if you agree it's correct please? (I'm quite pleased with how elegantly it worked out, but I'm worried it's a bit too good to be true!)

Anyway, some initial plots showing the space complexity as computed (absurdly slowly) by this script:

10 100 1000

So, I'm willing to bet there's a nice closed form for this...

petrelharp commented 6 years ago

Oh, great. I just wrote a section on this, not very well, over at #17. I will investigate.

jeromekelleher commented 6 years ago

Here's a couple of updated plots. The mean is the heavy black line.

For N=100, it looks like we're hitting equilibrium at around 33N.

10 100

petrelharp commented 6 years ago

Looks good to me! I will see about writing down the recursions.

jeromekelleher commented 6 years ago

Awesome! I'm excited about this, can't wait to see what it turns out the be equivalent to!

petrelharp commented 6 years ago

Well, I still haven't lighted on a way to write down recursions in any vaguely reasonable way.

I'm surprised, though, that there's more edges above than the rough calculation suggests:

def naive(n):
   a = 2 * (n-1)  # edges in initial tree
   b = 2 * n * sum(1/k for k in range(1,n)) # total length of the tree
   return a+b

>>> [naive(n) for n in (10,100,1000)]
[74.57936507936508, 1233.4755035279243, 16966.941721100688]
jeromekelleher commented 6 years ago

So, naive(n) is basically Hudson's formula for the number of recombination events, right? And this is roughly rho log N? It must be something like this, doesn't it? The equilibrium should be at k N log N, where k will contain the Euler-Mascharoni constant, the number of edges per recombination event and probably a 2 somewhere.

petrelharp commented 6 years ago

Hm, let's see: the argument is: number of edges is 2 N - 2 for the initial tree; then one new one for each recomb event. Recomb occurs with density 1 per unit length per generation, so mean number occurring on tree of total length L across distance x is xL; so number of edges should be 2 N - 2 + L, where L is now the mean total tree length, since total chromosome length is 1. Total tree length in haploid WF with N tips is sum(k * m[k] for k in 2:N), where m[k] is mean time spent having k tips; since prob that at least two of k tips coalesce in the previous generation is choose(k,2)/N = k (k-1)/2N, the mean time is like 2N/k (k-1) and so k * m[k] = 2N / (k-1). This is exactly right for the continuous-time coalescent; and I've calculated this numerically for the discrete-time WF model up to N=200 (in wf_expectations.R) and the right answer there is not very different. So, this predicts the number of edges would be

  2 * (N - 1 + N * sum(1/k for k in range(1,N))

This is approximately 2 N (1 + log(N) + gamma) where gamma is the EM constant.

I do think that the equilibrium number of edges is growing like N log N, but it looks like the calculation above is off by a factor of a bit less than 3.

jeromekelleher commented 6 years ago

This sounds right to me, except you assume 1 edge per recombination, which is wrong I think. In the edgesets case, we have at most 3 edgesets per recombination (see figure 5 in the msprime paper). I need to think about it again to figure out what the right number is now that we're using edges. I'll have a look at it in the next couple of hours.

petrelharp commented 6 years ago

Oh, right! This is true for edges also. There's that factor of 3. And, it's slightly less than 3 because of (a) recombinations involving the root, and (b) bubbles. So, multiply naive by 3 and we have a good upper bound.

petrelharp commented 6 years ago

Incidentally, the form of the rise is calculable, also: time spent with k tips is 2N/k(k-1) = 2N(1/(k-1) - 1/k) and so the time to go from N to n tips is 2N(1/(n-1) - 1/(N-1)); so at time T we expect to have around n(T) roots, where (replacing N-1 with N)

1/(n(T)-1) = T/2N + 1/N
n(T) = 1 + 1/(T/2N + 1/N)
        = 1 + 2N/(T+2)

The total tree length over this time is 2*N*sum(1/k for k in range(n+1,N)), or approximately 2 log(N/n), or roughly

2 * N *log(N/(1+2*N/(T+2)))

so the upper bound is 2 * (N-1) + 6 * N *log(N/(1+2*N/(T+2)))

petrelharp commented 6 years ago

Well, that looks nice. Er, doesn't quite look like a upper bound, though. See changes over at https://github.com/petrelharp/ftprime_ms/tree/alg-w-space 50 40 30

petrelharp commented 6 years ago

Looks like WF tree length only exceeds coal tree length by about 1.5%, at least in expectation, so that doesn't explain the discrepancy.

That calculation I did above did all sorts of swapping 1/E[X] for E[1/X], though, so we shouldn't be suprised it's off a tad.

jeromekelleher commented 6 years ago

Amazing! This is beautiful @petrelharp, such a great result!

I'm a bit surprised also that it's not an upper bound, but I think this is more than good enough for our purposes. I'll have a think and a look at it and see if I can think of anything though.

jeromekelleher commented 6 years ago

So, the expected number of edges per tree transition is not at most 3 here, it's something slightly bigger. I've done a quick hack here to compute the prediction using the mean number of edges computed from the actual replicates. All very dodgy of course, but I think it's clearly an improvement. I'll look more closely at the trees to see if I can figure out what's happening.

The red lines are when we plug in the number of edges from simulations, and blue where we assume it's 3.

50 100 200

petrelharp commented 6 years ago

oh, very nice.

On Thu, Nov 16, 2017 at 11:06 AM, Jerome Kelleher notifications@github.com wrote:

So, the expected number of edges per tree transition is not at most 3 here, it's something slightly bigger. I've done a quick hack here to compute the prediction using the mean number of edges computed from the actual replicates. All very dodgy of course, but I think it's clearly an improvement. I'll look more closely at the trees to see if I can figure out what's happening.

[image: 50] https://user-images.githubusercontent.com/2664569/32910342-e3dab630-cb00-11e7-9cd2-f995a8771a7c.png [image: 100] https://user-images.githubusercontent.com/2664569/32910344-e3f418e6-cb00-11e7-90a7-fceeb3b56433.png [image: 200] https://user-images.githubusercontent.com/2664569/32910345-e4100056-cb00-11e7-8ade-2aa2349a9a2b.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petrelharp/ftprime_ms/pull/18#issuecomment-345029849, or mute the thread https://github.com/notifications/unsubscribe-auth/AA_26WaVcpW42aDJOAzynnilwgM4ySVIks5s3IfMgaJpZM4QfhVX .

jeromekelleher commented 6 years ago

Actually, the maximum number of edges for an SPR is 4, and so we have a nice comfortable upper bound:

50 100

I've left in the two other lines, but the green one is the one we're interested in.

Here's an example of an SPR with 4 edges in and out:

       29        
  ┏━━━━┻━━━━┓    
  ┃         16   
  ┃      ┏━━┻━┓  
  14     13   ┃  
┏━┻┓   ┏━┻━┓  ┃  
┃  9   8   10 ┃  
┃ ┏┻┓ ┏┻┓ ┏┻┓ ┃  
3 4 6 1 2 5 7 0  

OUT = 4
{left=0.371, right=0.421, parent=29, child=16}
{left=0.371, right=0.421, parent=29, child=14}
{left=0.103, right=0.421, parent=16, child=13}
{left=0.371, right=0.421, parent=16, child=0}
IN = 4
{left=0.421, right=0.492, parent=21, child=13}
{left=0.421, right=0.492, parent=21, child=14}
{left=0.421, right=0.465, parent=29, child=0}
{left=0.421, right=0.465, parent=29, child=21}
    29           
┏━━━┻━━━┓        
┃       21       
┃    ┏━━┻━━━┓    
┃    13     14   
┃  ┏━┻━┓  ┏━┻┓   
┃  8   10 ┃  9   
┃ ┏┻┓ ┏┻┓ ┃ ┏┻┓  
0 1 2 5 7 3 4 6  

Here the subtree rooted at 13 got grafted into the edge joining 14 to 29. We see the same thing with msprime coalescent simulations, so I think this is just the way it is with the edge representation. I think we can just say it's 4 when we use edges instead of three with (binary) edgesets.

jeromekelleher commented 6 years ago

OK, that's the final plot for tonight. I think we should put (something like) this into the paper. I think it's a super-cool result (but I get a bit over-excited about space complexity results!)

50

petrelharp commented 6 years ago

will do

On Thu, Nov 16, 2017 at 1:03 PM, Jerome Kelleher notifications@github.com wrote:

OK, that's the final plot for tonight. I think we should put (something like) this into the paper. I think it's a super-cool result (but I get a bit over-excited about space complexity results!)

[image: 50] https://user-images.githubusercontent.com/2664569/32915578-471b4ff6-cb11-11e7-8736-d2342ec5434f.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petrelharp/ftprime_ms/pull/18#issuecomment-345061552, or mute the thread https://github.com/notifications/unsubscribe-auth/AA_26QhG2r6xayi09p9MR7kjliaA4jDfks5s3KMPgaJpZM4QfhVX .

petrelharp commented 6 years ago

Done (on master). Figure could be smaller, probably, but good enough for now.