petrelharp / ftprime_ms

4 stars 2 forks source link

we need a title #13

Closed petrelharp closed 6 years ago

petrelharp commented 6 years ago

Here are some ideas. Bad ideas allowed, since maybe they will lead to good ones.

How to speed up forwards-time simulations by recording absolutely everything

Tree sequences and forwards-time simuation: writing it all down saves you time later

Efficient forwards-time population genetics simulation

Efficient forwards-time simulation with tree sequences

Speed up your population genetics simulations with this one wierd trick

Faster fowards-time population genetics simulations by recording pedigrees

...? @molpopgen @ashander @jeromekelleher

jeromekelleher commented 6 years ago

All of these are excellent, especially the bad ones @petrelharp! I'm guessing there's no point in putting 'tree sequence' in the title since nobody knows what this is. I like:

Any other ideas @molpopgen @ashander ?

petrelharp commented 6 years ago

I'm not saying we need to put 'tree sequence' in, but I think there's a good argument for it: the term is pretty intuitive, and putting it in the title will make more people take notice of it. So then the title reads, effectively, "Efficient forwards-time population genetics simulation using tree sequence (which are defined and explained in this paper)".

But, saying 'pedigrees' describes more effectively what we're actually doing.

molpopgen commented 6 years ago

We need to show what is new-ish in this paper, meaning we need to mention the pedigree aspect. However, that idea is also in the "anafits" paper. The real novelty here is that the approach is largely independent of the simulation engine's inner workings, given that one is ok with working in a Python world. The challenge is to convey this in the title in a way that doesn't alienate non-programmers. Some initial attempts follow:

jeromekelleher commented 6 years ago

We need to show what is new-ish in this paper, meaning we need to mention the pedigree aspect. However, that idea is also in the "anafits" paper. The real novelty here is that the approach is largely independent of the simulation engine's inner workings, given that one is ok with working in a Python world. The challenge is to convey this in the title in a way that doesn't alienate non-programmers. Some initial attempts follow:

I agree that the general framework is new and important, but I'm not sure it's the right thing to emphasise in the title. As you point out, it sounds a bit "programmery", and might put off more theoretically minded people (I'm imaging Nick Barton or Brian Charlesworth reading the title here). We do actually do something pretty new in that we can store the entire history of a forwards-time simulation in a very small amount of space, which I don't think was true for anafits (am I right here @molpopgen?). So, how about

Modulo actually proving this (which I'm sure is true and fairly easy), this is perhaps the most important aspect. The efficiency stems from the fact that we can store the history in a very small amount of space, and then throw mutations on this concise representation afterwards.

molpopgen commented 6 years ago

I agree that the general framework is new and important, but I'm not sure it's the right thing to emphasise in the title. As you point out, it sounds a bit "programmery", and might put off more theoretically minded people (I'm imaging Nick Barton or Brian Charlesworth reading the title here). We do actually do something pretty new in that we can store the entire history of a forwards-time simulation in a very small amount of space, which I don't think was true for anafits (am I right here @molpopgen?). So, how about

The major claims of ana-fits were order of magnitude speed improvement and a massive reduction in memory used. Both are true, to the extent that the code works. (It is very fragile, and increasing the "per-bp" recombination rate causes it to exit with errors, etc. It is also not possible to compile the current github master.)

In msprime terms, ana-fits keeps track of a single table of mutation/recombination/coalescent events. Their Figure 1 shows the general idea. The end result is a very compact graph structure that is written to a binary file. (However, the tools to work with that file segfault on my machine...)

The paper is:

Aberer, Andre J., and Alexandros Stamatakis. 2013. “Rapid Forward-in-Time Simulation at the Chromosome and Genome Level.” BMC Bioinformatics 14 (July): 216.

Representing the history of forwards-time simulations in logarithmic space Modulo actually proving this (which I'm sure is true and fairly easy), this is perhaps the most important aspect. The efficiency stems from the fact that we can store the history in a very small amount of space, and then throw mutations on this concise representation afterwards.

I'm fine with this as a title, if it is correct. My tests are showing O(10x) differences in RAM b/w the C++ and simuPOP-based implementations, though, implying that even if the msprime side is logarithmic, some sims take ~100Gb RAM in practice.

I think we are seeing different things as novel. I see the parts about the storage, etc., as doing what ana-fits does, but in a code base that actually works across a wide range of parameters and will be maintained over time. For me, the exciting thing is the generality.

ashander commented 6 years ago

Trying to catch up and contribute here. What about this?

"Speeding up ~two~ commonly-used fowards-time population genetics simulations by recording pedigrees"

More comments

starting from from the bottom.

generality and logarithmic space

For me, the exciting thing is the generality.

I agree with Kevin it'd be nice to convey that our algorithm is general. My understanding of what this means: it can be hooked up to any (?) forward sim. No only that but we provide implementations that hook up to two widely used forward sims.

On earlier ideas

Comments above each prior suggestion (all of which I like, more or less):

I like this one the best but it's not getting at that generality...

Faster fowards-time population genetics simulations by recording pedigrees

These are a little clunky compared to the one just above. I think "more" causes me to wonder 'than what?' in a way that's annoying, whereas "faster" does not, for whatever reason.

Recording genealogies makes forward-time population genetic simulation more efficient More efficient forwards-time simulation by recording pedigrees

Kevin's suggestions get at the generality but I'm not sure they're as intriguing as the one I like best. Maybe due to more words?

A general framework for improving the efficiency of forward-time simulations via pedigree recording A flexible method for recording pedigrees greatly improves the performance of multiple forward-time simulation engines.

I like how concise this is. I'm not sure every reader we'd like to see this (e.g. theory people who don't do lots of sims, or people interested in questions that this tool allows us to ask) will appreciate the novelty from it on a TOC

Representing the history of forwards-time simulations in logarithmic space

jeromekelleher commented 6 years ago

I'm fine with this as a title, if it is correct. My tests are showing O(10x) differences in RAM b/w the C++ and simuPOP-based implementations, though, implying that even if the msprime side is logarithmic, some sims take ~100Gb RAM in practice.

Well this is only the case because we run several generations before simplify, right? In principle, we could run simplify after every generation, giving O(N) for the current generation and O(rho log N) for the previous history. We will want to run k generations to make things a bit more efficient, so we need O(kN + rho log N). When N is very large, then there's no avoiding kN being big, but it's a conscious time/space tradeoff. We could set k=1 and keep our space requirement to O(N) at the cost of spending more time in simplify.

I'll read the ana-fits paper and see if we can do a theoretical comparison with it.

I'm not strongly pushing for the title about BTW, I'm just trying to crystallise what the key theoretical gains we've made by doing things this way. Of course the generality is an excellent property.

molpopgen commented 6 years ago

Well this is only the case because we run several generations before simplify, right? In principle, we could run simplify after every generation, giving O(N) for the current generation and O(rho log N) for the previous history. We will want to run k generations to make things a bit more efficient, so we need O(kN + rho log N). When N is very large, then there's no avoiding kN being big, but it's a conscious time/space tradeoff. We could set k=1 and keep our space requirement to O(N).

It is possibly also due to storing things in lists vs numpy arrays in ftprime, but @ashander can comment more on that. The container choice by itself can have a huge effect here.

ashander commented 6 years ago

It is possibly also due to storing things in lists vs numpy arrays in ftprime, but @ashander https://github.com/ashander can comment more on that. The container choice by itself can have a huge effect here.

I think this is a big part of it, which I'm planning to eliminate ( https://github.com/ashander/ftprime/issues/41 ) but haven't yet.

It's bubbling closer to the top of my list as our own use of this tool is requiring quite a bit of memory for each run and reducing that would speed things up for us. If it makes a big difference for the framing I can prioritize it.

molpopgen commented 6 years ago

It's bubbling closer to the top of my list as our own use of this tool is requiring quite a bit of memory for each run and reducing that would speed things up for us. If it makes a big difference for the framing I can prioritize it.

Not as important for the framing of the problem, IMO. More an issue of maximizing the efficiency of things in the simuPOP world.

petrelharp commented 6 years ago

This thread is veering away from title ideas. I think our title should say what the important things about our paper are. It doesn't matter if anafits does even the same thing, because we do it better in lots of ways. I also don't think we need to compare to anafits if it is fragile, but we should definitely cite & discuss it.

petrelharp commented 6 years ago

Perhaps we could work in these folks somewhere? https://en.wikipedia.org/wiki/Moirai ... too bad we don't need to name a software package - it'd be a nice analogy.

edit: Although, we could name our method - that's a thing people do. And, it'd give people a thing to grab on to when citing it. ("We present an algorithm, called "moirai", that...")

petrelharp commented 6 years ago

Collecting suggestions:

petrelharp commented 6 years ago

I think we want to say these ingredients:

... and have it not sound like we only do engineering to speed stuff up - but point out we also store all of history in the process.

Er, let's see:

molpopgen commented 6 years ago

Speeding up two commonly-used fowards-time population genetics simulations by recording pedigrees

Pedantic point: fwdpp/fwdpy11 are not widely-used. There are only three forward simulations that can reasonably claim to have been used: sfs_code, slim/slim2, and simuPOP. As for the rest, the majority of citations have nothing to do with actual use.

ashander commented 6 years ago

"Weaving the fabric of fate: efficient pedigree recording for fast population genetics simulation."

I like this

As for the rest, the majority of citations have nothing to do with actual

use.

Is there any common pattern in what (non uses) people cite fwdpp paper(s) for?

molpopgen commented 6 years ago

Is there any common pattern in what (non uses) people cite fwdpp paper(s) for?

Most papers describing forward sims say, "I've written a forward simulator. Others exists (references)." Some performance comparisons may be made under neutrality. Some are not replicable. Many of the projects are then never developed further. Maybe 1/2 of fwdpp's citations are use, the rest seem to be either other simulators or reviews. IIRC, all of ana-fits citations are not related to actually using it.

molpopgen commented 6 years ago

Perhaps we could work in these folks somewhere? https://en.wikipedia.org/wiki/Moirai ... too bad we don't need to name a software package - it'd be a nice analogy.

edit: Although, we could name our method - that's a thing people do. And, it'd give people a thing to grab on to when citing it. ("We present an algorithm, called "moirai", that...")

This is (mostly) taken:

Putnam, Patrick P., Philip A. Wilsey, and Ge Zhang. 2015. “Clotho: Addressing the Scalability of Forward Time Population Genetic Simulation.” BMC Bioinformatics 16 (June): 191.

Software here.

petrelharp commented 6 years ago

meh, just as well.

jeromekelleher commented 6 years ago

I like "Weaving the fabric of fate: efficient pedigree recording for fast population genetics simulation." Shall we go with this? We can change it later, but we should probably write the paper first!

petrelharp commented 6 years ago

Doing that will require discussing the 'fates', which currently I'm feeling meh about. But, I'll put it in! We can revisit.

On Wed, Nov 15, 2017 at 1:07 AM, Jerome Kelleher notifications@github.com wrote:

I like "Weaving the fabric of fate: efficient pedigree recording for fast population genetics simulation." Shall we go with this? We can change it later, but we should probably write the paper first!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petrelharp/ftprime_ms/issues/13#issuecomment-344529839, or mute the thread https://github.com/notifications/unsubscribe-auth/AA_26eUfEZkxiLxn0cok97Z5KEk3HC7Aks5s2qnvgaJpZM4QatCL .

molpopgen commented 6 years ago

I like "Weaving the fabric of fate: efficient pedigree recording for fast population genetics simulation." Shall we go with this? We can change it later, but we should probably write the paper first!

See above--the idea has already been used for a forward sim.

jeromekelleher commented 6 years ago

OK, so drop the prefix? "Efficient pedigree recording for fast population genetics simulation." Seems to tick the boxes.