petrelharp / ftprime_ms

4 stars 2 forks source link

First pass at set-theoretical simplify. #28

Closed jeromekelleher closed 6 years ago

jeromekelleher commented 6 years ago

Here's some initial doodling for a set-theory version of simplify @petrelharp. The idea is that we can greatly simplify things if we look at the genetic intervals literally as sets (say we limit to a discrete genome). Most of the complexity of our algorithm comes from dealing with linked lists of segments.

The basic ideas are in here I think, but it's not fully there yet. I think what needs to be done is to make the elements of A dictionaries rather than sets, and it'll follow fairly easily from there. I'll make another pass on Monday and try to finish it, but I thought I'd share this much in case you'd like to take a look.

petrelharp commented 6 years ago

Hm, that's fun.

jeromekelleher commented 6 years ago

Obviously it's hideously inefficient, but that's not the point. I think it's a the right level of abstraction for the paper. I'll finish it up ASAP.

petrelharp commented 6 years ago

So, you think to replace the current algorithm write-up with this one? That sounds good. We could put the current writeup in the msprime docs somewhere if it seems useful.

jeromekelleher commented 6 years ago

I've updated this a bit @petrelharp, and it works now (note: I haven't bothered with internal samples). It's now explicitly working locus-by-locus, which makes it a lot easier to understand what's going on I think. It can probably be made a bit simpler, but I thought I'd run it by you before going any further with it. Do you think this is the best approach for the paper?

I don't want to throw away what you've already done, but I'm struggling to see how anyone will use it as it is. At least with this dumbo-version (even if it is very inefficient) we should be able to write it out concisely, and describe the purpose of every step without writing too much. Getting people to understand the algorithm is much more important than getting them to understand our implementation, I think.

petrelharp commented 6 years ago

I'm in favor of writing down a simpler version. But I'm not a fan of the "discrete vector" implementation, I think, where everyone is a vector of length L. It does make it quite clear, but it's starting to get far enough from the actual representation I'm worried it will confuse people. Why did you make that switch?

jeromekelleher commented 6 years ago

I'm in favor of writing down a simpler version. But I'm not a fan of the "discrete vector" implementation, I think, where everyone is a vector of length L. It does make it quite clear, but it's starting to get far enough from the actual representation I'm worried it will confuse people. Why did you make that switch?

Fair enough, it's true it's a long way from the real implementation. I made the switch because the majority of our complexity comes from fiddling with overlapping segments. It drastically simplifies things if you can use ordinary set operations on the intervals carrying the ancestral material. I can think about it again if you think it's important we stick with intervals rather than L loci.

I think we need to be able to write the full algorithm down in roughly this much Python, or I'm not sure there's much point in including in the main text.

petrelharp commented 6 years ago

I think we could do it relatively simply after defining a few helper functions that operate on "sets" that are encoded as ordered lists of (left, right) intervals?

When I did this before (the first implementation) I ended up thinking that thinking of everything as step functions rather than lists of labeled intervals was easier, btw.

jeromekelleher commented 6 years ago

OK, I'll have a go at implementing by using an existing interval arithmetic library. Might do the trick.

jeromekelleher commented 6 years ago

I've had another go using an interval tree library, and it's a bit simpler now and works for the general interval case (no discrete loci anymore). Apart from some variable renaming, I think this is pretty close now.

One nice realisation from going through this process is that we don't actually need to snip out the ancestral material for parents: all we need to know is how it transfers. This might make simplifier_remove_ancestry a bit simpler in the real implementation.

What do you think @petrelharp, do you agree this is a good basis for the algorithm in the paper? Feel free to make a pass at the code BTW, it'd be good to get your input!

petrelharp commented 6 years ago

Yes, this is much nicer than what is in the paper. I have read through it carefully, and don't have any improvements. Translating the intervaltree operations to words/notation will take a bit of thinking.

And, good point about not needing to actually remove ancestry in remove_ancestry. It certainly feels cleaner to remove it, but I don't think we actually use the fact that we do (unless it's used in figuring out the number of roots at the end?). Removing them would in princple also allow lower memory usage if we didn't keep the whole output ts in memory, but we're not doing that, either.

jeromekelleher commented 6 years ago

And, good point about not needing to actually remove ancestry in remove_ancestry. It certainly feels cleaner to remove it, but I don't think we actually use the fact that we do (unless it's used in figuring out the number of roots at the end?). Removing them would in princple also allow lower memory usage if we didn't keep the whole output ts in memory, but we're not doing that, either.

I don't think we're using the number of roots at all now, so I think we should be able to get rid of the 'snipping out' bit without affecting other stuff. Might actually speed things up a bit, as we're spending most of our time in this function (in the forward simulation case).