jean997 / cause

R package for CAUSE
https://jean997.github.io/cause/
52 stars 15 forks source link

long runtime #9

Closed changd15 closed 3 years ago

changd15 commented 4 years ago

Hi, When running cause, the total process (parameter estimation + fitting the model) is taking about 1.5hours on average. From the documentation, it seems like cause should run much quicker. I am not getting any error and cause is providing the expected output though. Do you know of ways to speed up the param estimation and model fitting steps? thanks!

jean997 commented 4 years ago

How many snps are you estimating parameters with? That is the longest step.

changd15 commented 4 years ago

I am using 1mill random snps.

jean997 commented 4 years ago

Hmm. 1.5 hours is a bit long but not outside of the range I have seen from my own analyses (I think in a few instances it took 6 hours but those are quite unusual). It can vary from application to application by how many components it tries to put in the parameter distribution. The code should be able to take advantage of multiple cores though so you could try running it with more. There is a parameter in est_cause_params called max_candidates that caps the number of components in the bivariate mixture distribution at max_candidates^2. You could check and see how many components you are getting by looking at the number of unique elements of params_obj$mix_grid$S1 and params_obj$mix_grid$S2 and then cap the maximum at something less than this. I wouldn't recommend going too low here though since that could lead to a very coarse grid that doesn't represent the data well.

I would be curious to know how many components are in a model that is taking this long would you be willing to share the two numbers i described above?

changd15 commented 4 years ago

thanks for the response! I looked into my params objects -- the number of unique elements in S1 is 6, while the number of unique elements in S2 in 3. In your experience, are these numbers particularly large?

jean997 commented 4 years ago

That seems totally reasonable to me. I took a look at the parameter estimates for some of the analyses in the paper to get an idea of ranges for you. Here are some summaries over 297 analyses.

For unique elements of S1 and S2, I saw a median of 7 and a range of 3 to 12. The total number of rows in mix_grid had a median of 16 and a range of 6 to 28 (we don't end up needing all possible pairs of S1 and S2 in the final estimate)

Time: My runs had a median elapsed time of 12.9 minutes with a range of 6.8 to 227.7 minutes but only 9 runs took longer than an hour so there is a bit of a long tail here (first and third quartiles 10.7 and 16.4). The user time in my analyses is usually between 3 and 4 times larger than the elapsed time because I gave my jobs 4 cores and the code is able to run a lot of the calculations in parallel. The user time had a median of 40 minutes with first and third quartiles of 31 and 53 minutes.

So the best way to reduce run time is to give the job more cores. I hope this helps!