richelbilderbeek / pirouette

R package that estimates the error BEAST2 makes from a given phylogeny
GNU General Public License v3.0
3 stars 2 forks source link

create_std_pir_params is slow #411

Closed Giappo closed 4 years ago

Giappo commented 4 years ago

Running create_std_pir_params can take years. I have given a look into it. The bottleneck is here:

twinning_params <- create_twinning_params( sim_twin_tree_fun = get_sim_bd_twin_tree_fun(), sim_twal_fun = get_sim_twal_same_n_muts_fun( mutation_rate = mutation_rate, max_n_tries = 10000 ), twin_evidence_filename = twin_evidence_filename )

pir_params <- create_pir_params( alignment_params = alignment_params, experiments = experiments, twinning_params = twinning_params, evidence_filename = evidence_filename )

The bottleneck disappears as max_n_tries is set to something smaller, like 1e2 or 1e3. This happens because, apparently, it's hard to get the same number of mutations with this parameter setting.

We gotta check if it's ok to reduce the number or, in alternative, if it's better to change other parameters in order to require less tries.

Giappo commented 4 years ago

There is something that escapes my comprehension, I will explain.

The goal of this function is to create pir_params and NOT to run a pir_run. This means that there is no true_phylogeny, as pir_run requires a true_phylogeny as well as pir_params, but as INDEPENDENT inputs. This means that the number max_n_tries should make no difference, because its meaning is to try n times to replicate the same numbers of mutations in the twin_alignment. But to produce a twin alignment you need a twin_phylogeny as well as a true_alignment which, in turn, requires a true_phylogeny, which we CANNOT have because we are not yet running pir_run.

TL;DR pirouette is not running, we are just building pir_params. The time grows as max_n_tries increases but there should be no tries going on because there is no phylogeny, only pir_params. Yet something is running max_n_tries times. I suspect is a test and it should not do it so many times. I will investigate further.

Giappo commented 4 years ago

The problem is in check_pir_params

Giappo commented 4 years ago

And also in check_twinning_params

Giappo commented 4 years ago

Digging in -> check_sim_twal_fun

Giappo commented 4 years ago

I switched to a simpler test phylogeny and a different root sequence (of higher entropy) in check_sim_twal_fun. Now it seems to go faster but there is one thing that must be kept in mind: this test depends on random variables, as the twin alignment is randomly generated. I thought of adding a set.seed, but we do not want to impose a specific pseudorandomness when checking variables, so I preferred to simplify the test.

richelbilderbeek commented 4 years ago

I merged your Pull Request, well done :+1:

What is the result of the code profile? I know, these are hard to interpret :man_scientist:

richelbilderbeek commented 4 years ago

I will take over from here :+1:

richelbilderbeek commented 4 years ago

Running:

library(pirouette)
filename <- tempfile()
utils::Rprof(filename)
create_std_pir_paramses(n = 1)
utils::Rprof(NULL)
print(utils::summaryRprof(filename))

Results in

$by.self
                                                               total.pct
"pirouette::check_experiments"                                     91.64
"pirouette::check_experiments_candidates_have_same_mcmcs"          74.29
"check_mcmc_values"                                                58.45
"beautier::check_treelog"                                          24.02
"check_treelog_values"                                             23.58
"check_screenlog_values"                                           23.40
"check_tracelog_values"                                            22.95
"is_in_range"                                                      12.54
"are_equal_treelogs"                                               11.21

$by.total
                                                               total.time total.pct self.time
"create_std_pir_paramses"                                           22.48    100.00      0.00
"pirouette::check_experiments"                                      20.60     91.64      0.02
"pirouette::check_pir_params_data_types"                            17.96     79.89      0.00
"pirouette::check_pir_params"                                       17.96     79.89      0.00
"pirouette::check_experiments_candidates_have_same_mcmcs"           16.70     74.29      0.02
"beautier::are_equal_mcmcs"                                         16.62     73.93      0.00
"create_std_pir_params"                                             14.88     66.19      0.00
"beautier::check_mcmc"                                              13.16     58.54      0.00
"check_mcmc_values"                                                 13.14     58.45      0.04

$sample.interval
[1] 0.02

$sampling.time
[1] 22.48

Number 1 time waster: check_experiments_candidates_have_same_mcmcs

richelbilderbeek commented 4 years ago

I couldn't speed up check_experiments_candidates_have_same_mcmcs easily [*], I think beautier::check_mcmc is the sweet spot.

richelbilderbeek commented 4 years ago

Created a repo to let Travis check: https://github.com/richelbilderbeek/pirouette_profile

richelbilderbeek commented 4 years ago

A repo to let Travis check beautier's speed: https://github.com/richelbilderbeek/beautier_profile

richelbilderbeek commented 4 years ago

It's fine as it is :+1: