jeetsukumaran / DendroPy

A Python library for phylogenetic scripting, simulation, data processing and manipulation.
https://pypi.org/project/DendroPy/.
BSD 3-Clause "New" or "Revised" License
210 stars 61 forks source link

RNG seed doesn't determine result of `treesim.birth_death_tree` #108

Closed benjaminwilson closed 5 years ago

benjaminwilson commented 5 years ago

Multiple calls to treesim.birth_death_tree passing instances of random.Random with the same seed yield multiple distinct trees. Here is code to reproduce:

import dendropy
import platform

print('dendropy:', dendropy.__version__)
print('python:', platform.python_version())

from dendropy.simulate import treesim
import random

def generate(**kwargs):
    t = treesim.birth_death_tree(birth_rate=1.0,
                                 death_rate=0.5,
                                 num_total_tips=6,
                                 is_retain_extinct_tips=True, **kwargs)
    return t.as_string('newick')

num_trials = 1000

trees = set([generate(rng=random.Random(1)) for _ in range(num_trials)])
print('number of distinct trees generated:', len(trees), '(when passing rng)')

trees = set([generate() for _ in range(num_trials)])
print('number of distinct trees generated:', len(trees), '(when not passing rng)')

Which yields:

dendropy: 4.4.0
python: 3.5.2
number of distinct trees generated: 30 (when passing rng)
number of distinct trees generated: 1000 (when not passing rng)

The output shows that passing the RNG does make a difference ... but not enough of one :)

This is likely related to #88 (perhaps the example there just needs to be run multiple times to produce the issue).

jeetsukumaran commented 5 years ago

Hi Benjamin,

Thank you for bringing this to our attention. The issue is due to the fact that, while running the processing, set objects are used to track the pool of extinct and extant tip nodes. As you know, order is not maintained by these containers. Nodes are popped off these as required, but this means that while the process itself is identical under the same random seed as advertised, the specific nodes that get selected for, e.g., birth events vary due to the container semantics.

The solution to this is to adopt a list. But this is going to have a performance hit in a couple of places.

jeetsukumaran commented 5 years ago

Ok, maybe give a06f95e2 a spin. I've changed the node pool containers to lists, and, with the order invariance, the RNG seed dependent determinism now holds. Not sure what the performance penalties are at this point, though.

benjaminwilson commented 5 years ago

Thanks for the quick fix! I can confirm that this resolves the indeterminancy. A quick benchmark seems to indicate no performance penalty:

%timeit t = treesim.birth_death_tree(birth_rate=1.0, death_rate=0.5, num_total_tips=100, is_retain_extinct_tips=True)

Before: 9.6 ms ± 34.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

At a06f95e:

9.34 ms ± 51.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

jeetsukumaran commented 5 years ago

Great. I've merged this into the development-master branch.