liamrevell / phytools

GNU General Public License v3.0
207 stars 56 forks source link

pbtree behaviour: taxa-stop criterion not working as expected #80

Closed josephwb closed 3 years ago

josephwb commented 3 years ago

pbtree does not seem to be honouring the 'taxa-stop' criterion. For a given set of parameters, the function produces a tree with far fewer terminals than specified. Here is a plot simulating trees with geiger (sim.bd.tree) and phytools (n=100, b=0.1, d=0.05, nsim=10000):

sim_bd_trees [The plot for TreeSim sim.bd.taxa is identical to that of geiger, and is omitted here for clarity]

Out of 10000 simulations, 2525 had fewer than 100 terminals. This is presumably due to entire trees that stochastically go extinct. The behaviour can be reproduced here (R v3.6.3, phytools_0.7-70):

set.seed(123);
pxx <- pbtree(b=0.1, d=0.05, n=100);

I see that when conditioning on n and t, that there is an option method="rejection" that repeatedly simulates trees until a tree is produced that satisfies the specified conditions. Does this exist for conditioning on n alone? Geiger has this option: if specifying extinct=FALSE trees where all lineages are extinct are not returned, and I think TreeSim just does this automatically.

It is definitely desirable for some questions to return entirely extinct trees (e.g., the 25.5% extinction probability of the parameters above is interesting in itself), but not if one wants a viable tree with which to to do something.

At the moment we can enforce our own conditional wrapper that checks that the 'taxa-stop' criterion has been honoured, but it seems very straightforward to add the option to pbtree itself.

liamrevell commented 3 years ago

This is correct. It's a feature, not a bug. ;)

I decided that a taxon-stopping criterion (i.e., stop when the number of lineages gets to N or the tree goes extinct) should be treated differently than conditional sampling (i.e., sample stochastic trees conditioning on N extant lineages at t, given b & d).

If the user wants to simulate trees with a high extinction rate & a large number of taxa, N, they should know that most of these trees go extinct before getting to N.

liamrevell commented 3 years ago

Ooh. I should also mention that if you condition on b, d, t, & N, you can also set method="direct" and it will be a bit faster.

Lastly, extant.only=TRUE does not work when conditioning on t & N. This is a known bug, not a feature. Haha.

josephwb commented 3 years ago

Thanks for the feedback. This makes sense, and sim.bdtree from geiger works this way by default (i.e., extinct=TRUE). However, this was not clear from the documentation of pbtree; we only noticed it when downstream analyses were producing strange results. Perhaps add a note that trees where all lineages have gone extinct are returned.

I guess I would recommend adding this (the option to only return trees that have not gone extinct) as an option (very easy to implement using just Ntip()) but keeping the default setting as it currently works (like geiger). Up to you. Feel free to close this issue if you disagree.

liamrevell commented 3 years ago

I added a note to the documentation of this.