naturalis / supersmart

Self-Updating Platform for the Estimation of Rates of Speciation, Migration And Relationships of Taxa
MIT License
17 stars 5 forks source link

remove all random elements #56

Open hettling opened 9 years ago

hettling commented 9 years ago

The unit test 'smrt-pipeline.t' shows that there are still some random elements in the pipeline and therefore it is difficult to reproduce results exactly with multiple pipeline runs. This becomes apparent especially with small trees for taxa with few data. So far we set random seeds for the tools that we use (Examl, RaxML, Treepl...) but we still have some randomness, especially in the beginning of the pipeline:

For the race conditions in parallel mode, I don't know if this is easy to fix, but as a start we could try running the whole pipeline on one core to see if we can exactly reproduce results.

rvosa commented 9 years ago

How do we tag this - is this a show-stopping issue for milestone v1.0?

hettling commented 9 years ago

I don't think it's show-stopping, the pipeline does produce good trees. I think of it as 'nice to have this sorted out at some point' though, but the simulations are more important at the moment.

hettling commented 9 years ago

Getting rid of randomness in smrt align is quite expensive, since sequences are retrieved sequencially from the database. I made a note of this issue in commit ef7601b

rvosa commented 9 years ago

As far as completely reproducibility is concerned we also have randomness when resolving trees (currently you cannot set the seed for the random generator that Bio::Phylo uses) and when bootstrapping (same thing). Of course in principle this can be set in perl, using srand EXPR, which needs to be done very early on (in a BEGIN block?) so that all subsequent calls to rand use that seed.

hettling commented 9 years ago

This works now for the backbone, everything up to smrt bbinfer gives reproducible results (except for rotations in the newick tree).

rvosa commented 9 years ago

Excellent, that means we can meaningfully do a parameter exploration at least for the backbone data mining. Op Thu, 18 Jun 2015 om 17:38 schreef hettling notifications@github.com

This works now for the backbone, everything up to smrt bbinfer gives reproducible results (except for rotations in the newick tree).

— Reply to this email directly or view it on GitHub https://github.com/naturalis/supersmart/issues/56#issuecomment-113195910 .

hettling commented 9 years ago

smrt bbdecomose now seems to be reproducible, I did checksums over directories for a few runs and they are the same. Not tested on many datasets yet.