Open wjchulme opened 3 years ago
run time = ~20 hours max ram usage = ~90Gb largest model object memory = ~50Gb
So that's a few days for specific brands and different outcomes
How many combinations, exactly? So the top limiting factor is (as we suspected) RAM - currently we'd probably only be able to run 2 or 3 in parallel.
Please liaise with @evansd about running your test again once he's deployed our now monitoring code - then we can get more accurate data about the memory / cpu use.
Do the model-fitting in Stata and see if there are significant performance improvements. Stata automatically parallelises logistic regression
How much effort would it be to (roughly) test the speed-up we'd get?
@wjchulme The monitoring code is up and running now so we shouldn't need to do anything special to record stats for your jobs.
How much effort would it be to (roughly) test the speed-up we'd get?
To completely re-write the model_msm_dose1.R
script it would be quite a lot of a effort. But to do a head-to-head comparison on one or two of the larger models, it shouldn't take more than a few hours.
There's also this R package for calling Stata within R, but I suspect that's a Docker nightmare.
@evansd I ran 3 big jobs last night, was it up and running then?
The models
actions in this request http://jobs.opensafely.org/job-requests/1225/
Sadly not
I can re-run now? I needed to tweak something anyway
Yep, go for it :+1:
@evansd Those 3 RAM-intensive actions have now finished (+ 3 others) http://jobs.opensafely.org/job-requests/1235/
So most practical two things we can do:
Just to confirm that peak memory usage is, as predicted, 50GB. Here's a not-at-all-readable chart of memory usage over time for each of these jobs in case it's useful to get a sense of the shape:
The killer script in this study is
models_msm_dose1
.This script does quite a few things, but essentially it fits 9 logistic regression models on a dataset with
cohort_size*follow-up days
rows. For the over 80s followed up for 3 months, that's about 90 million rows. Each of the 9 regression models is fit in parallel using theparglm
function (currently asking for 8 cores), but the script itself if run sequentially, so the next model only begins once the previous model has finished.I've trimmed as much fat as I can from the input datasets, for example with column type conversion and removing unused variables.
The script itself frees up RAM by deleting large objects that are no longer needed.
The model-fitting is the most time- and memory-consuming process. Some savings could potentially be made for example by coverting some data manipulation operations to
data.table
, which is fast for large data frames, but the resulting data frame will still occupy the same amount of space.The biggest issue is that this script needs to be run multiple times for different cohorts (eg over 80s, 70-79s), outcomes (positive tests, admissions, deaths), vaccine brands (any, pfixer, AZ), and clinical subgroups (lots!). So if it takes 4 hours to run the script once, it will take days to run across all combinations (though some are lower priority than others).
Example run times
For a dataset of 11,629,272 rows, based on 141,498 patients (that's 20% of study-eligible 80+ year olds), we currently have:
When stratified by sex (ie repeat script separately for men and women), then we have 6,749,986 (female) and 4,879,286 (male) rows, based on 82,088 and 59,410 patients:
So it looks like run time scales nearly perfectly linearly, and memory use scales a bit less than linearly.
If we ran on a dataset of 1,000,000 patients (eg all 80 year olds), I'd estimate that running once would be:
So that's a few days for specific brands and different outcomes.
Possible options to improve run time
parglm
doesn't support this)