palaeoware / trevosim

TREvoSim - The [Tr]ee [Evo]lutionary [Sim]ulator program
GNU General Public License v3.0
4 stars 3 forks source link

Speciation deliniation #59

Open ms609 opened 2 weeks ago

ms609 commented 2 weeks ago

I'm looking over the revised manuscript. The new tree figure is a great improvement and is much easier to follow, particularly alongside the new description of the Algorithm and Concepts in the documentation pages.

One thing that's making me slightly nervous is that the comparison point for defining a new species is a function of the last new species to be defined. I think that this is likely to cause "bursts" (or at least pairs) of speciation events. An example of this issue is the rise of species 2 in your figure:

image

Almost by definition, as soon as the first speciation event occurs, the next individual of Species zero to mutate will become a new species; indeed, it's possible to see a speciation arise when an individual mutates to become more similar to its ancestral genome. As an example, some of the four lineages of Sp2 ringed in the green box are "latent species": the origin of Sp3 means that as soon as they mutate, in any direction, they will qualify as a new species; and which lineage is identified as the ancestor of the new species is only a function of which happens to mutate first - not which is most similar/different to the new species. Unless the region of genome space becomes vacated, we thus expect trees to continue to generate "different" species that are very similar to species zero, simply as a result of drift away from the initial genome.

Under an anagenetic framework, I would expect to see something like the top tree here: one species/lineage doesn't change much, and occasionally side-populations become different enough to denote separate species:

image

but in TREvoSim, we would see many species defined, with origination points occurring more or less in pairs, and with the one static species exhibiting five different identities:

image

I can see that the alternative of comparing each individual to the ancestral state of the species is also problematic – you'd end up with a plethora of species each time a lineage stepped over the 4-difference threshold. The approach that would feel most consistent with standard taxonomic practice would be to give an individual a new species name when it cannot be accommodated in any existing species definition, i.e. when it is 4 different from any previously defined species. I don't think this would completely eliminate the non-independent nature of speciation events, but I think it would reduce the tendency of one speciation event to trigger another – and thus for the speciation patterns observed to emphasize 'how we define species' rather than 'how evolution produces variation', which I think is the more interesting quantity.

I'm not sure how well I've explained this – do these reservations make sense?

RussellGarwood commented 3 days ago

As ever, @ms609 I am hugely grateful for your thoughts, and I have also spent a good amount of time since your post thinking about this (and indeed, the speciation aspect of TREvoSim has been something I've considered long and hard over the years as I have developed the software). I've put my thoughts below on:

  1. How much bursts are likely to arise through the mechanism you highlight of comparison to last speciation
  2. How realistic or otherwise this is in light of empirical data we might want to compare to (~anagenesis vs cladogenesis)
  3. Other options, any or all of which I would be keen to implement - but may well depend on the questions for which the software is likely to be used to understand

Potential impact Your analysis of the potential problems here are sound, but how much they are likely to impact on the nature of TREvoSim and its data depends on the nature of what happens in a given run. It is likely to be more prominent when we have species zero still in the simulation - because of that initially large population, this is different from your average species once the simulation has run for a while. In contrast, when settling on this approach, my mindset was that (outside the edge effect of species zero, and under the majority of simulation parameters) we likely have different evolving lineages with a low enough genomic diversity / high enough species turnover, that the situation you highlight is not going to be too prevalent - i.e. we don't have "latent species" hanging around too long, and when they are, they are unlikely to be being selected for duplication as they are probably unfit. Obviously, how much that is true depends on playing field size, species difference, and the strength of the coupling between fitness and likelihood of selection for duplication. However, the basis for that mindset has been based on checking out node distribution in trees, where - under the majority of settings I have tried - the nodes are fairly evenly spread, and there isn't much in the way of a saltation dynamic there.

~anagenesis vs cladogenesis The nature of how TREvoSim deals with anagenetic as opposed to cladogenetic change is partly based on a practical considerations from when I started coding! Then, we were thinking about simulating palaeontological like data with phylogenetic questions in mind. In that light - in my field at least - if we find one fossil in horizon A, another in horizon B, separated by a million years, and they have any difference in moprhology, they would be included as species irresepctive of the process by which they differentiated, and this approached worked. As time has passed as well as the simulated data, it has become clearer that the software might be of utility in answering more ecological/evolutionary questions where we don't want space as a counfounding factor (REvoSim is otherwise a better tool as it has a strict biologival species concept and is much faster). In that light, the considerations you highlight will have more impact!

Future directions I would be keen to incorporate a few options for how to define species on the basis of the question one is asking. For instance, the current approach will allow TREvoSim data that matches a bunch of empirical datasets where coding choices dictate that the disparity between terminals is pretty samey. I also find your idea of comparing to all previous speciations in a lineage very attractive, as it would - as you highlight - allow a bit more variation in the simulated data, and associated evolutionary questions about variation. There is a third option, which is to employ a Mayr-inspired reproductive isolation type concept. I originally shied away from this for two reasons:

  1. Computational challenges introduced by having a playing field large enough to allow this, associated with the exhaustive pair-wise comparison required every iteration to identify new species (indeed, this may well completely preclude the approach!)
  2. These are not sexually reproducing populations (c.f. REvoSim where we do this when they are), and thus it is not as obviously appropriate in tihs case.

Nevertheless, I can see that for some ecological-type questions, this may still be attractive (e.g. if we wanted to think about questions at the intersection of morphological evolution and anagenetic v.s. cladogenetic change). I would welcome any thoughts you have on this third option!

As it stands, I would like to keep the current approach for v3.0.0 as I have a few ongoing projects that use it, and I would propose adding new speciation modes as options in separate releases.

(I suppose there is a fourth option which is to compare mutants to modal genomes, but I suspect that might become very complicated, very quicky).