palaeoware / trevosim

TREvoSim - The [Tr]ee [Evo]lutionary [Sim]ulator program
GNU General Public License v3.0
4 stars 3 forks source link

Rate heterogeneity #41

Open ms609 opened 1 month ago

ms609 commented 1 month ago

This one's unrelated to the review – the more I play with TREvoSim, the more use cases I'm seeing for it! One thing I'd like to do is to simulate some datasets where different characters have different (known) rates. How straightforward would that be to implement – or is there a way to emulate this in the current software?

RussellGarwood commented 1 month ago

This is a great idea, and I must admit it is also one I have given some thought to before. I would be very happy indeed to add it, but there are a few different ways it could be achieved and I would welcome your thoughts on what would align most closely with what you have in mind (there is currently no way of coercing the software into doing this that I can think of).

In terms of implementing it, broadly, it would be relatively straightforward - currently the software just deals with the number of mutations for all sites in the genome and calculates the number of mutations per-genome per-replication based on the user-defined probability. To add rate heterogeneity in all I would need to do is change that to a per-character per-replication basis. There are a few details in how to implement that though, which could have impact. Off the top of my head:

Any and all thoughts welcome!

ms609 commented 1 month ago

It sounds like this distinction touches on the interesting and almost philosophical question of what we mean by rate.

Under a "mutating organism" model, if a mutation from 0→1 keeps happening but always makes the organism too unfit to reproduce, then as far as the observed genome is concerned, the observed rate of accumulated changes will be very small (even if the instantaneous mutation rate is high).

Under the "mutating mask" model, conversely, if the rate of mutation is high relative to the rate of change of the mask, then we would expect all taxa to keep pace with the mask itself (and hence we may infer a low rate, as most taxa are the same for the character).

I don't see a non-homogeneous rate in the masks as unrealistic though – is this not akin to certain aspects of an environment changing more rapidly or frequently than others?


I think the most powerful and flexible option is to allow a user to specify a rate for each site, and this is what I would aim to do (by procedurally generating an appropriate settings / rates file). However, I can see that this could be cumbersome to conduct using the GUI. I could imagine a GUI option allowing the user to select some distribution of rates. The Mk model usually assumes that rates between characters follow a gamma distribution, but a gaussian or truncated uniform might also be viable.

individuals to pass the probabilities of mutation per site down to their descendents.

This could allow some very interesting explorations! It's beyond what I'd have in mind myself though. There's another interesting parallel in the heritable "tempo" parameter in Budd & Mann's latest model: https://www.biorxiv.org/content/10.1101/2024.02.01.578373v1

RussellGarwood commented 1 month ago

-- I don't see a non-homogeneous rate in the masks as unrealistic though – is this not akin to certain aspects of an environment changing more rapidly or frequently than others?

I suppose it depends really on how directed v.s. neutral one's view of molecular evolution is!

I can happily implement something to reflect this discussion over the summer once all changes required by the review are done - whilst the code is still in my head. To that end, would I be correct in concluding that perhaps - as an initial approach - the most attractive option would be to allow you to load a CSV that dictates mutation probabilities for each site in the mask? As an initial effort, I'd assume the same probability across masks for a given site (i.e. your file tells me bit zero has a 10% chance of mutation, this probability is assessed separately and applied as required to each mask at that bit, before moving on to bit 1).

I've also long been planning on adding a record of how many mutations per site there are in the genome so stasis through lack of change can be differentiated from stasis through saturation. Would this be of utility for the stuff you have in mind as well?

ms609 commented 1 month ago

That sounds grand, thanks! The mutations per site data would be very useful too.