Closed Giappo closed 4 years ago
There is something that escapes my comprehension, I will explain.
The goal of this function is to create pir_params and NOT to run a pir_run. This means that there is no true_phylogeny, as pir_run requires a true_phylogeny as well as pir_params, but as INDEPENDENT inputs. This means that the number max_n_tries should make no difference, because its meaning is to try n times to replicate the same numbers of mutations in the twin_alignment. But to produce a twin alignment you need a twin_phylogeny as well as a true_alignment which, in turn, requires a true_phylogeny, which we CANNOT have because we are not yet running pir_run.
TL;DR pirouette is not running, we are just building pir_params. The time grows as max_n_tries increases but there should be no tries going on because there is no phylogeny, only pir_params. Yet something is running max_n_tries times. I suspect is a test and it should not do it so many times. I will investigate further.
The problem is in check_pir_params
And also in check_twinning_params
Digging in -> check_sim_twal_fun
I switched to a simpler test phylogeny and a different root sequence (of higher entropy) in check_sim_twal_fun. Now it seems to go faster but there is one thing that must be kept in mind: this test depends on random variables, as the twin alignment is randomly generated. I thought of adding a set.seed, but we do not want to impose a specific pseudorandomness when checking variables, so I preferred to simplify the test.
I merged your Pull Request, well done :+1:
What is the result of the code profile? I know, these are hard to interpret :man_scientist:
I will take over from here :+1:
Running:
library(pirouette)
filename <- tempfile()
utils::Rprof(filename)
create_std_pir_paramses(n = 1)
utils::Rprof(NULL)
print(utils::summaryRprof(filename))
Results in
$by.self
total.pct
"pirouette::check_experiments" 91.64
"pirouette::check_experiments_candidates_have_same_mcmcs" 74.29
"check_mcmc_values" 58.45
"beautier::check_treelog" 24.02
"check_treelog_values" 23.58
"check_screenlog_values" 23.40
"check_tracelog_values" 22.95
"is_in_range" 12.54
"are_equal_treelogs" 11.21
$by.total
total.time total.pct self.time
"create_std_pir_paramses" 22.48 100.00 0.00
"pirouette::check_experiments" 20.60 91.64 0.02
"pirouette::check_pir_params_data_types" 17.96 79.89 0.00
"pirouette::check_pir_params" 17.96 79.89 0.00
"pirouette::check_experiments_candidates_have_same_mcmcs" 16.70 74.29 0.02
"beautier::are_equal_mcmcs" 16.62 73.93 0.00
"create_std_pir_params" 14.88 66.19 0.00
"beautier::check_mcmc" 13.16 58.54 0.00
"check_mcmc_values" 13.14 58.45 0.04
$sample.interval
[1] 0.02
$sampling.time
[1] 22.48
Number 1 time waster: check_experiments_candidates_have_same_mcmcs
I couldn't speed up check_experiments_candidates_have_same_mcmcs
easily [*], I think beautier::check_mcmc
is the sweet spot.
experiment_1
by experiments[[i]]
would reduce long copies, but that was a misconceptionCreated a repo to let Travis check: https://github.com/richelbilderbeek/pirouette_profile
A repo to let Travis check beautier's speed: https://github.com/richelbilderbeek/beautier_profile
It's fine as it is :+1:
Running create_std_pir_params can take years. I have given a look into it. The bottleneck is here:
twinning_params <- create_twinning_params( sim_twin_tree_fun = get_sim_bd_twin_tree_fun(), sim_twal_fun = get_sim_twal_same_n_muts_fun( mutation_rate = mutation_rate, max_n_tries = 10000 ), twin_evidence_filename = twin_evidence_filename )
pir_params <- create_pir_params( alignment_params = alignment_params, experiments = experiments, twinning_params = twinning_params, evidence_filename = evidence_filename )
The bottleneck disappears as max_n_tries is set to something smaller, like 1e2 or 1e3. This happens because, apparently, it's hard to get the same number of mutations with this parameter setting.
We gotta check if it's ok to reduce the number or, in alternative, if it's better to change other parameters in order to require less tries.