Closed petrelharp closed 6 years ago
All of these are excellent, especially the bad ones @petrelharp! I'm guessing there's no point in putting 'tree sequence' in the title since nobody knows what this is. I like:
Efficient forwards-time population genetics simulation
Faster fowards-time population genetics simulations by recording pedigrees
Some ideas:
Recording genealogies makes forward-time population genetic simulation more efficient
More efficient forwards-time simulation by recording pedigrees
Any other ideas @molpopgen @ashander ?
I'm not saying we need to put 'tree sequence' in, but I think there's a good argument for it: the term is pretty intuitive, and putting it in the title will make more people take notice of it. So then the title reads, effectively, "Efficient forwards-time population genetics simulation using tree sequence (which are defined and explained in this paper)".
But, saying 'pedigrees' describes more effectively what we're actually doing.
We need to show what is new-ish in this paper, meaning we need to mention the pedigree aspect. However, that idea is also in the "anafits" paper. The real novelty here is that the approach is largely independent of the simulation engine's inner workings, given that one is ok with working in a Python world. The challenge is to convey this in the title in a way that doesn't alienate non-programmers. Some initial attempts follow:
We need to show what is new-ish in this paper, meaning we need to mention the pedigree aspect. However, that idea is also in the "anafits" paper. The real novelty here is that the approach is largely independent of the simulation engine's inner workings, given that one is ok with working in a Python world. The challenge is to convey this in the title in a way that doesn't alienate non-programmers. Some initial attempts follow:
I agree that the general framework is new and important, but I'm not sure it's the right thing to emphasise in the title. As you point out, it sounds a bit "programmery", and might put off more theoretically minded people (I'm imaging Nick Barton or Brian Charlesworth reading the title here). We do actually do something pretty new in that we can store the entire history of a forwards-time simulation in a very small amount of space, which I don't think was true for anafits (am I right here @molpopgen?). So, how about
Modulo actually proving this (which I'm sure is true and fairly easy), this is perhaps the most important aspect. The efficiency stems from the fact that we can store the history in a very small amount of space, and then throw mutations on this concise representation afterwards.
I agree that the general framework is new and important, but I'm not sure it's the right thing to emphasise in the title. As you point out, it sounds a bit "programmery", and might put off more theoretically minded people (I'm imaging Nick Barton or Brian Charlesworth reading the title here). We do actually do something pretty new in that we can store the entire history of a forwards-time simulation in a very small amount of space, which I don't think was true for anafits (am I right here @molpopgen?). So, how about
The major claims of ana-fits were order of magnitude speed improvement and a massive reduction in memory used. Both are true, to the extent that the code works. (It is very fragile, and increasing the "per-bp" recombination rate causes it to exit with errors, etc. It is also not possible to compile the current github master.)
In msprime terms, ana-fits keeps track of a single table of mutation/recombination/coalescent events. Their Figure 1 shows the general idea. The end result is a very compact graph structure that is written to a binary file. (However, the tools to work with that file segfault on my machine...)
The paper is:
Aberer, Andre J., and Alexandros Stamatakis. 2013. “Rapid Forward-in-Time Simulation at the Chromosome and Genome Level.” BMC Bioinformatics 14 (July): 216.
Representing the history of forwards-time simulations in logarithmic space Modulo actually proving this (which I'm sure is true and fairly easy), this is perhaps the most important aspect. The efficiency stems from the fact that we can store the history in a very small amount of space, and then throw mutations on this concise representation afterwards.
I'm fine with this as a title, if it is correct. My tests are showing O(10x) differences in RAM b/w the C++ and simuPOP-based implementations, though, implying that even if the msprime side is logarithmic, some sims take ~100Gb RAM in practice.
I think we are seeing different things as novel. I see the parts about the storage, etc., as doing what ana-fits does, but in a code base that actually works across a wide range of parameters and will be maintained over time. For me, the exciting thing is the generality.
Trying to catch up and contribute here. What about this?
"Speeding up ~two~ commonly-used fowards-time population genetics simulations by recording pedigrees"
starting from from the bottom.
For me, the exciting thing is the generality.
I agree with Kevin it'd be nice to convey that our algorithm is general. My understanding of what this means: it can be hooked up to any (?) forward sim. No only that but we provide implementations that hook up to two widely used forward sims.
Comments above each prior suggestion (all of which I like, more or less):
I like this one the best but it's not getting at that generality...
Faster fowards-time population genetics simulations by recording pedigrees
These are a little clunky compared to the one just above. I think "more" causes me to wonder 'than what?' in a way that's annoying, whereas "faster" does not, for whatever reason.
Recording genealogies makes forward-time population genetic simulation more efficient More efficient forwards-time simulation by recording pedigrees
Kevin's suggestions get at the generality but I'm not sure they're as intriguing as the one I like best. Maybe due to more words?
A general framework for improving the efficiency of forward-time simulations via pedigree recording A flexible method for recording pedigrees greatly improves the performance of multiple forward-time simulation engines.
I like how concise this is. I'm not sure every reader we'd like to see this (e.g. theory people who don't do lots of sims, or people interested in questions that this tool allows us to ask) will appreciate the novelty from it on a TOC
Representing the history of forwards-time simulations in logarithmic space
I'm fine with this as a title, if it is correct. My tests are showing O(10x) differences in RAM b/w the C++ and simuPOP-based implementations, though, implying that even if the msprime side is logarithmic, some sims take ~100Gb RAM in practice.
Well this is only the case because we run several generations before simplify, right? In principle, we could run simplify after every generation, giving O(N) for the current generation and O(rho log N) for the previous history. We will want to run k generations to make things a bit more efficient, so we need O(kN + rho log N). When N is very large, then there's no avoiding kN being big, but it's a conscious time/space tradeoff. We could set k=1 and keep our space requirement to O(N) at the cost of spending more time in simplify.
I'll read the ana-fits paper and see if we can do a theoretical comparison with it.
I'm not strongly pushing for the title about BTW, I'm just trying to crystallise what the key theoretical gains we've made by doing things this way. Of course the generality is an excellent property.
Well this is only the case because we run several generations before simplify, right? In principle, we could run simplify after every generation, giving O(N) for the current generation and O(rho log N) for the previous history. We will want to run k generations to make things a bit more efficient, so we need O(kN + rho log N). When N is very large, then there's no avoiding kN being big, but it's a conscious time/space tradeoff. We could set k=1 and keep our space requirement to O(N).
It is possibly also due to storing things in lists vs numpy arrays in ftprime, but @ashander can comment more on that. The container choice by itself can have a huge effect here.
It is possibly also due to storing things in lists vs numpy arrays in ftprime, but @ashander https://github.com/ashander can comment more on that. The container choice by itself can have a huge effect here.
I think this is a big part of it, which I'm planning to eliminate ( https://github.com/ashander/ftprime/issues/41 ) but haven't yet.
It's bubbling closer to the top of my list as our own use of this tool is requiring quite a bit of memory for each run and reducing that would speed things up for us. If it makes a big difference for the framing I can prioritize it.
It's bubbling closer to the top of my list as our own use of this tool is requiring quite a bit of memory for each run and reducing that would speed things up for us. If it makes a big difference for the framing I can prioritize it.
Not as important for the framing of the problem, IMO. More an issue of maximizing the efficiency of things in the simuPOP world.
This thread is veering away from title ideas. I think our title should say what the important things about our paper are. It doesn't matter if anafits does even the same thing, because we do it better in lots of ways. I also don't think we need to compare to anafits if it is fragile, but we should definitely cite & discuss it.
Perhaps we could work in these folks somewhere? https://en.wikipedia.org/wiki/Moirai ... too bad we don't need to name a software package - it'd be a nice analogy.
edit: Although, we could name our method - that's a thing people do. And, it'd give people a thing to grab on to when citing it. ("We present an algorithm, called "moirai", that...")
Collecting suggestions:
I think we want to say these ingredients:
... and have it not sound like we only do engineering to speed stuff up - but point out we also store all of history in the process.
Er, let's see:
Speeding up two commonly-used fowards-time population genetics simulations by recording pedigrees
Pedantic point: fwdpp/fwdpy11 are not widely-used. There are only three forward simulations that can reasonably claim to have been used: sfs_code, slim/slim2, and simuPOP. As for the rest, the majority of citations have nothing to do with actual use.
"Weaving the fabric of fate: efficient pedigree recording for fast population genetics simulation."
I like this
As for the rest, the majority of citations have nothing to do with actual
use.
Is there any common pattern in what (non uses) people cite fwdpp paper(s) for?
Is there any common pattern in what (non uses) people cite fwdpp paper(s) for?
Most papers describing forward sims say, "I've written a forward simulator. Others exists (references)." Some performance comparisons may be made under neutrality. Some are not replicable. Many of the projects are then never developed further. Maybe 1/2 of fwdpp's citations are use, the rest seem to be either other simulators or reviews. IIRC, all of ana-fits citations are not related to actually using it.
Perhaps we could work in these folks somewhere? https://en.wikipedia.org/wiki/Moirai ... too bad we don't need to name a software package - it'd be a nice analogy.
edit: Although, we could name our method - that's a thing people do. And, it'd give people a thing to grab on to when citing it. ("We present an algorithm, called "moirai", that...")
This is (mostly) taken:
Putnam, Patrick P., Philip A. Wilsey, and Ge Zhang. 2015. “Clotho: Addressing the Scalability of Forward Time Population Genetic Simulation.” BMC Bioinformatics 16 (June): 191.
Software here.
meh, just as well.
I like "Weaving the fabric of fate: efficient pedigree recording for fast population genetics simulation." Shall we go with this? We can change it later, but we should probably write the paper first!
Doing that will require discussing the 'fates', which currently I'm feeling meh about. But, I'll put it in! We can revisit.
On Wed, Nov 15, 2017 at 1:07 AM, Jerome Kelleher notifications@github.com wrote:
I like "Weaving the fabric of fate: efficient pedigree recording for fast population genetics simulation." Shall we go with this? We can change it later, but we should probably write the paper first!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/petrelharp/ftprime_ms/issues/13#issuecomment-344529839, or mute the thread https://github.com/notifications/unsubscribe-auth/AA_26eUfEZkxiLxn0cok97Z5KEk3HC7Aks5s2qnvgaJpZM4QatCL .
I like "Weaving the fabric of fate: efficient pedigree recording for fast population genetics simulation." Shall we go with this? We can change it later, but we should probably write the paper first!
See above--the idea has already been used for a forward sim.
OK, so drop the prefix? "Efficient pedigree recording for fast population genetics simulation." Seems to tick the boxes.
Here are some ideas. Bad ideas allowed, since maybe they will lead to good ones.
How to speed up forwards-time simulations by recording absolutely everything
Tree sequences and forwards-time simuation: writing it all down saves you time later
Efficient forwards-time population genetics simulation
Efficient forwards-time simulation with tree sequences
Speed up your population genetics simulations with this one wierd trick
Faster fowards-time population genetics simulations by recording pedigrees
...? @molpopgen @ashander @jeromekelleher