seattleflu / incidence-mapper

R interface to database, map model training, and model data API Server
MIT License
5 stars 1 forks source link

GEOID x time models need too much memory #21

Open famulare opened 5 years ago

famulare commented 5 years ago

Documenting for future reference.

Issue: 9276b734543daa6529025904f556900303e4f3ad

King county at census tract level for 9 months for six pathogens requires at least 120GB of memory to run. I knew memory footprint would be an issue, but I didn't have a good way to estimate it, so now we know.

The short term approach to dealing with this is to do each pathogen one at a time, instead of all pathogens simultaneously. This doesn't allow borrowing across pathogens to estimate the latent field hyperparameters, but it's an open modeling question whether that's a good idea or not, and benchmarking against simulated data is a longer-term to-do.


Turns out the problem was greatly exacerbated by a major bug in appendSmoothModel fixed here: fdeeb1d81310b9e1c89eb0154b07584c4cd24aef

The issue of the appropriateness of borrowing power across pathogens still stands.

famulare commented 5 years ago

GEOID x time for a single pathogen over 9 months now peaks at roughly 12 GB during the linear combinations "compilation step".

famulare commented 5 years ago

For a badly conditioned model, I'm seeing it overflow-- 50GB in ram and 50 GB in swap. The swap might be clones of the ram, but it's bad. Granted, badly conditioned models are not useful, so I should find some way to kill those automatically and report it out...

tsibley commented 5 years ago

It's possible to spin up cloud machines with very high amounts of RAM provisioned. I don't know if relying on those makes sense long-term, but it might be a stop-gap solution for now to avoid dealing with memory usage.

famulare commented 5 years ago

True, that's an option, but not urgent. Given the first version of the incidence model and the expected amount of data, I'm not that excited about the scientific validity at the census tract scale. On the simulated data, the current version of the models is very noisy and maps based on the best fit look over-fit. This isn't at all surprising, and things are better behaved as we aggregate up (also expected). There's a lot of paths forward to better behaved models, and I'm optimistic we'll get better performance on better-behaved models.