tskit-dev / pyslim

Tools for dealing with tree sequences coming to and from SLiM.
MIT License
27 stars 23 forks source link

read new section on time units #284

Closed petrelharp closed 2 years ago

petrelharp commented 2 years ago

You can read the new section I wrote at https://tskit.dev/pyslim/docs/latest/time_units.html

The last section ("Converting to SLiM time and back") is not new (although I made edits); however, the stuff before that is new and should have a wide-ish audience. It'd be nice if someone had a read of it to see if you agree and if it makes sense. @bhaller?

bhaller commented 2 years ago

Hey. Catching up on things. First of all, in reading your doc I realized that the doc for SLiM 4 had not been updated correctly to reflect the new policy (which you state correctly in what you wrote) that the defaunt time unit is now "ticks" in all cases. I've just rewritten the doc as follows:

The timeUnit parameter controls the time unit stated in the tree sequence when it is saved (which can be accessed through tskit APIs); it has no effect on the running simulation whatsoever. The default value, NULL, means that a time unit of "ticks" will be used for all model types. (In SLiM 3.7 / 3.7.1, NULL implied a time unit of "generations" for WF models, but "ticks" for nonWF models; given the new multispecies timescale parameters in SLiM 4, a default of "ticks" makes sense in all cases since now even in WF models one tick might not equal one biological generation.) It may be helpful to set timeUnit to "generations" explicitly when modeling non-overlapping generations in which one tick equals one generation, to tell tskit that the time unit does in fact represent biological generations; doing so may avoid warnings from tskit or msprime regarding the time unit, in cases such as recapitation where the simulation timescale is important.

I'll probably rip a new PDF of the manual to reflect this change. But anyhow, you write that it is always "ticks" so this is just a doc problem on my end, and I don't think affects anything you have done/written.

Edits:

Overall, LGTM! Whew!

But I think this review was very helpful for leading to the two ideas above: that SLiM could keep track of parent ages, and that there could be an easy way to rescale a given tree sequence from ticks to generations, given a biological generation time (perhaps automatically derived from SLiM's proposed automatic recording of parent ages!) as a scaling factor. I think if we do those two thing this whole story becomes much, much simpler for users.

I also think that a pass over the SLiM manual's chapter 17, with an eye towards adding mention of this and other real-world issues, and adding cross-links into the pyslim documentation, would be a Very Good Idea. I would suggest that you and I ought to do this kind of separately-but-together: do it one subsection at a time, reading and thinking separately and then having a shared-screen discussion and revision of the manual for that one subsection, then moving on to the next. If we do one subsection at a time, we can spread this work over a longer period of time and make the project less monolithic. It feels important to do, though; SLiM users need to know about these sorts of issues, and need to go to the pyslim doc to read about them, and the existing general statement to "Go read the pyslim doc, it's important" is presumably ignored/forgotten by most users.

petrelharp commented 2 years ago

"Since msprime (by default) runs a continuous-time simulation, there is no harm in adjusting time units in msprime this way, but it is confusing." – so, is there a way to make all this less confusing? Can the user rescale the time used by their tree sequence, in Python, so that then they can do mutation overlay, recapitation, etc., without having these confusing timescale issues because their tree sequence is now in whatever time unit is "natural" for tskit? That seems like it would be the best; maybe that's a feature request for tskit, or for pyslim? Basically a "rescale to units of generations" call, which would be passed a rescaling factor T, and would both rescale all times in the tree sequence, and change the time unit to "generations".

This is tempting, but it's not at all clear how we'd actually implement this. Maybe just having a method to rescale all the times (so one could convert to generations) would make things a lot easier? I'm not totally convinced it would, because 'generations' is more abstract ('years' is better), and the notion of generation time is slippery (depends on context; the correct notion of generation time for effective population size is actually slightly different than for mutation rate...).

bhaller commented 2 years ago

This is tempting, but it's not at all clear how we'd actually implement this. Maybe just having a method to rescale all the times (so one could convert to generations) would make things a lot easier? I'm not totally convinced it would, because 'generations' is more abstract ('years' is better), and the notion of generation time is slippery (depends on context; the correct notion of generation time for effective population size is actually slightly different than for mutation rate...).

Hmm. Well, I certainly don't understand all the ins and outs of this issue. But your doc suggests ways to rescale various parameters passed in to msprime to compensate for the difference between the tree sequence's timescale and msprime's intrinsic timescale. All I'm proposing is to provide a method that does that rescaling up front, rather than for every call to msprime, fixing the timescale of the tree sequence once and for all, so that those rescaling issues are put to rest and the user can stop worrying about them from then on. There are doubtless difficult issues to definitively pin down with respect to this – but those same issues would exist equally for rescaling the parameters passed to msprime, no? So that doesn't seem like an argument against providing such a method – the same objection applies equally to doing the scaling with each call to msprime, no?

petrelharp commented 2 years ago

Hm - perhaps msprime methods could have a generation_time argument; or tree sequences could have a generation_time property. Seems like there's gotchas with either proposal. It's definitely an msprime proposal, though, not a pyslim one.

petrelharp commented 2 years ago

Should this be something that SLiM tracks automatically, as it now tracks lifetime reproductive output automatically?

That could be quite useful, yes! We should do some testing though first, to double check that we're measuring generation time appropriately.

petrelharp commented 2 years ago

"birth happens between the “early()” and “late()” stages" – I would get rid of the quotes around these, here and elsewhere; unnecessary and just makes it harder to read

I think we need something that makes it super clear we're talking about "the early() stage" instead of "an early period of time", if you know what I mean. But, maybe italics + () will do it?

petrelharp commented 2 years ago

I also think that a pass over the SLiM manual's chapter 17, with an eye towards adding mention of this and other real-world issues, and adding cross-links into the pyslim documentation, would be a Very Good Idea.

Yep; this is rather unfortunately a complex topic. (mostly thanks to interfacing with coalescent theory land, really)

bhaller commented 2 years ago

Hm - perhaps msprime methods could have a generation_time argument; or tree sequences could have a generation_time property. Seems like there's gotchas with either proposal. It's definitely an msprime proposal, though, not a pyslim one.

Hmm. I feel like shifting the timescale from SLiM's conception of time (ticks) to msprime's (generations) is very natural as a pyslim feature. I'm surprised you're not on board with this, it seems like a slam dunk to me. Just rescale all the times in the tree sequence by the factor provided by the user, change the time unit tag, and the timescale issue is thereby resolved, almost completely painlessly.

bhaller commented 2 years ago

"birth happens between the “early()” and “late()” stages" – I would get rid of the quotes around these, here and elsewhere; unnecessary and just makes it harder to read

I think we need something that makes it super clear we're talking about "the early() stage" instead of "an early period of time", if you know what I mean. But, maybe italics + () will do it?

Even without the italics I think the parentheses are pretty unambiguous. Anybody who has used SLiM at all knows what that syntactic convention means.

petrelharp commented 2 years ago

I read the content for the WF tab, and then when I switched to the nonWF tab I expected the text below the table to change to talk about nonWF stuff;

Er, it certainly should go away? Here's what I see when I click on "nonWF"? Screenshot from 2022-08-15 06-40-44

petrelharp commented 2 years ago

Hm - perhaps msprime methods could have a generation_time argument; or tree sequences could have a generation_time property. Seems like there's gotchas with either proposal. It's definitely an msprime proposal, though, not a pyslim one.

Hmm. I feel like shifting the timescale from SLiM's conception of time (ticks) to msprime's (generations) is very natural as a pyslim feature. I'm surprised you're not on board with this, it seems like a slam dunk to me. Just rescale all the times in the tree sequence by the factor provided by the user, change the time unit tag, and the timescale issue is thereby resolved, almost completely painlessly.

Good point; I guess I am not totally on board because I would rather work in units of years than generations - it's more concrete. But, maybe it is the right thing to do? One bother is we'd not be able to shift the origination time attribute of mutations (since this is an int).

bhaller commented 2 years ago

I read the content for the WF tab, and then when I switched to the nonWF tab I expected the text below the table to change to talk about nonWF stuff;

Er, it certainly should go away? Here's what I see when I click on "nonWF"?

Yes, I'm talking about the text below the table:

Constrictor 1

I was expecting that discussion of WF models to change to something about nonWF models. If it doesn't, then maybe there could be a more obvious delineation between what portion of the page the tabs control, versus what portion they don't. Anyhow, not a big deal, but it confused me when I was reading.

bhaller commented 2 years ago

Hm - perhaps msprime methods could have a generation_time argument; or tree sequences could have a generation_time property. Seems like there's gotchas with either proposal. It's definitely an msprime proposal, though, not a pyslim one.

Hmm. I feel like shifting the timescale from SLiM's conception of time (ticks) to msprime's (generations) is very natural as a pyslim feature. I'm surprised you're not on board with this, it seems like a slam dunk to me. Just rescale all the times in the tree sequence by the factor provided by the user, change the time unit tag, and the timescale issue is thereby resolved, almost completely painlessly.

Good point; I guess I am not totally on board because I would rather work in units of years than generations - it's more concrete. But, maybe it is the right thing to do? One bother is we'd not be able to shift the origination time attribute of mutations (since this is an int).

Well, even if you want to work in years, the tree sequence from SLiM might be on some other timescale, depending on what a "tick" is, so you might still want to rescale it so you can think in terms of years.

But not being able to rescale the mutation origin times properly is a big problem. Maybe that's the nail in the coffin for this idea. Unfortunate. Maybe if we shift to keeping mutation metadata in a separate place (as I've proposed elsewhere, so that it is recorded just a single time), we should change the origin time to a float, even though from SLiM it is always an integer. Hmm.

petrelharp commented 2 years ago

I moved the stuff after the table to before the table (and rewrote it, as it was somewhat out of date and I'm not sure the code was quite correct).