opensafely / covid-vaccine-effectiveness-research

MIT License
0 stars 0 forks source link

Improve run-time and RAM efficiency #35

Open wjchulme opened 3 years ago

wjchulme commented 3 years ago

The killer script in this study is models_msm_dose1.

This script does quite a few things, but essentially it fits 9 logistic regression models on a dataset with cohort_size*follow-up days rows. For the over 80s followed up for 3 months, that's about 90 million rows. Each of the 9 regression models is fit in parallel using the parglm function (currently asking for 8 cores), but the script itself if run sequentially, so the next model only begins once the previous model has finished.

I've trimmed as much fat as I can from the input datasets, for example with column type conversion and removing unused variables.

The script itself frees up RAM by deleting large objects that are no longer needed.

The model-fitting is the most time- and memory-consuming process. Some savings could potentially be made for example by coverting some data manipulation operations to data.table, which is fast for large data frames, but the resulting data frame will still occupy the same amount of space.

The biggest issue is that this script needs to be run multiple times for different cohorts (eg over 80s, 70-79s), outcomes (positive tests, admissions, deaths), vaccine brands (any, pfixer, AZ), and clinical subgroups (lots!). So if it takes 4 hours to run the script once, it will take days to run across all combinations (though some are lower priority than others).

Example run times

For a dataset of 11,629,272 rows, based on 141,498 patients (that's 20% of study-eligible 80+ year olds), we currently have:

When stratified by sex (ie repeat script separately for men and women), then we have 6,749,986 (female) and 4,879,286 (male) rows, based on 82,088 and 59,410 patients:

So it looks like run time scales nearly perfectly linearly, and memory use scales a bit less than linearly.

If we ran on a dataset of 1,000,000 patients (eg all 80 year olds), I'd estimate that running once would be:

So that's a few days for specific brands and different outcomes.

Possible options to improve run time

sebbacon commented 3 years ago

run time = ~20 hours max ram usage = ~90Gb largest model object memory = ~50Gb

So that's a few days for specific brands and different outcomes

How many combinations, exactly? So the top limiting factor is (as we suspected) RAM - currently we'd probably only be able to run 2 or 3 in parallel.

Please liaise with @evansd about running your test again once he's deployed our now monitoring code - then we can get more accurate data about the memory / cpu use.

Do the model-fitting in Stata and see if there are significant performance improvements. Stata automatically parallelises logistic regression

How much effort would it be to (roughly) test the speed-up we'd get?

evansd commented 3 years ago

@wjchulme The monitoring code is up and running now so we shouldn't need to do anything special to record stats for your jobs.

wjchulme commented 3 years ago

How much effort would it be to (roughly) test the speed-up we'd get?

To completely re-write the model_msm_dose1.R script it would be quite a lot of a effort. But to do a head-to-head comparison on one or two of the larger models, it shouldn't take more than a few hours.

There's also this R package for calling Stata within R, but I suspect that's a Docker nightmare.

wjchulme commented 3 years ago

@evansd I ran 3 big jobs last night, was it up and running then?

The models actions in this request http://jobs.opensafely.org/job-requests/1225/

evansd commented 3 years ago

Sadly not

wjchulme commented 3 years ago

I can re-run now? I needed to tweak something anyway

evansd commented 3 years ago

Yep, go for it :+1:

wjchulme commented 3 years ago

@evansd Those 3 RAM-intensive actions have now finished (+ 3 others) http://jobs.opensafely.org/job-requests/1235/

sebbacon commented 3 years ago

So most practical two things we can do:

  1. Estimate total run time. Say there was enough memory to run up to 4 jobs simultaneously, how long would the whole thing take? If it's a week, is that too long? What is too long? It may be we can live with that, in which case no further action requried.
  2. Investigate parallelisation within R. On the grounds that this might be able to share memory among several processes. Then we could run a load more regressions at once.
evansd commented 3 years ago

Just to confirm that peak memory usage is, as predicted, 50GB. Here's a not-at-all-readable chart of memory usage over time for each of these jobs in case it's useful to get a sense of the shape:

visualization