stan-dev / pystan

PyStan, a Python interface to Stan, a platform for statistical modeling. Documentation: https://pystan.readthedocs.io
ISC License
342 stars 59 forks source link

Deleting model fits from cache in a simulation setting (where each model is given a different seed) #361

Open coenvdm opened 2 years ago

coenvdm commented 2 years ago

Describe the problem with the documentation

I am working with STAN models in a simulation study, using PyStan, where I implement the same model multiple times with different values for random_seed. I noticed that after fitting, the fit is saved to my cache folder under httpstan/4.4.2/models/“model_name”/fits/“fit_name”.

The problem I run into is that my memory gets cluttered by these files, while I don’t need them. I have tried clearing the folders containing these files manually, but since I am using parallelization, I cannot just delete entire folders on the go.

Is there a way to delete fit-files after I retrieve the posterior samples that I want, or keep Stan from saving these files? I tried using the delete_fit-function from httpstan.cache, which requires you to specify an identifier for the (e.g. model_name), which is easy to obtain, and an identifier for the fit (e.g. fit_name), which I am not sure how to obtain (there is a calculate_fit_name-function in httpstan.fits, but I cannot get it to work). The documentation on how to use these functions (calculate_fit_name and delete_fit) is not clear to me.

Suggest a potential alternative/fix

Could you provide a use case on how to delete model fits from cache (in a setting where a new model is fitted within each iteration of a for-loop)?

ahartikainen commented 2 years ago

Tbh, I don't think we have a way to access the identifier which means there is no good way to do this.

I don't remember if there is any way to turn of the caching.

Have you tried CmdStanPy or do you need logp / grad?

coenvdm commented 2 years ago

I haven't tried CmdStanPy. Would you expect that to have a fix for this problem? Sorry, I am not sure what log/grad is, could you explain?

riddell-stan commented 2 years ago

There's probably a fix for this. If you have access to a larger (ephemeral) disk, you can set your user cache directory so it uses this disk. I think the environment variable is XDG_CACHE_HOME.

You could also create some kind of cron job or run another helper script in the background that deletes things.

eboileau commented 2 years ago

Hi, I'm also having this problem, in fact not even changing the seed. I am testing using a simple regression model, built and pickled using random data using one script, then in another script I load the model, build it again with new data, and sample from it e.g.

import stan
import pickle
import numpy as np

data = {...}
model = pickle.load(open("model.pkl", "rb"))
posterior = stan.build(model.program_code, data=data, random_seed=101)
fit = posterior.sample(num_chains=4, num_samples=1000, num_warmup=500, num_thin=1)

Each time I run this script (assuming pystan finds the model, and the cache has not been cleaned, in which case it will have to build again from scratch), pystan actually writes num_chains files to cache (under fits, one for each chain)... so you can imagine how quickly hundreds of files can quickly accumulate...

Having an option to NOT cache the fits, i.e. keep Stan from saving these files would be great...

riddell-stan commented 2 years ago

There is a change I would welcome here: evict/delete old cached fits if the cache grows beyond a certain limit. In short, intelligently manage the cache.

It's difficult to come up with a robust caching policy. For this reason, we haven't made adding this feature a priority.

eboileau commented 2 years ago

@riddell-stan Thanks for your quick reply.

I don't know how difficult that would be, but an ideal solution would be to have some option to sample e.g.

fit = posterior.sample(..., cache=False)

but any improvement as you mention is obviously welcome.