Open coenvdm opened 2 years ago
Tbh, I don't think we have a way to access the identifier which means there is no good way to do this.
I don't remember if there is any way to turn of the caching.
Have you tried CmdStanPy or do you need logp / grad?
I haven't tried CmdStanPy. Would you expect that to have a fix for this problem? Sorry, I am not sure what log/grad is, could you explain?
There's probably a fix for this. If you have access to a larger (ephemeral) disk, you can set your user cache directory so it uses this disk. I think the environment variable is XDG_CACHE_HOME
.
You could also create some kind of cron job or run another helper script in the background that deletes things.
Hi, I'm also having this problem, in fact not even changing the seed. I am testing using a simple regression model, built and pickled using random data using one script, then in another script I load the model, build it again with new data, and sample from it e.g.
import stan
import pickle
import numpy as np
data = {...}
model = pickle.load(open("model.pkl", "rb"))
posterior = stan.build(model.program_code, data=data, random_seed=101)
fit = posterior.sample(num_chains=4, num_samples=1000, num_warmup=500, num_thin=1)
Each time I run this script (assuming pystan finds the model, and the cache has not been cleaned, in which case it will have to build again from scratch), pystan actually writes num_chains
files to cache (under fits, one for each chain)... so you can imagine how quickly hundreds of files can quickly accumulate...
Having an option to NOT cache the fits, i.e. keep Stan from saving these files would be great...
There is a change I would welcome here: evict/delete old cached fits if the cache grows beyond a certain limit. In short, intelligently manage the cache.
It's difficult to come up with a robust caching policy. For this reason, we haven't made adding this feature a priority.
@riddell-stan Thanks for your quick reply.
I don't know how difficult that would be, but an ideal solution would be to have some option to sample
e.g.
fit = posterior.sample(..., cache=False)
but any improvement as you mention is obviously welcome.
Describe the problem with the documentation
I am working with STAN models in a simulation study, using PyStan, where I implement the same model multiple times with different values for random_seed. I noticed that after fitting, the fit is saved to my cache folder under httpstan/4.4.2/models/“model_name”/fits/“fit_name”.
The problem I run into is that my memory gets cluttered by these files, while I don’t need them. I have tried clearing the folders containing these files manually, but since I am using parallelization, I cannot just delete entire folders on the go.
Is there a way to delete fit-files after I retrieve the posterior samples that I want, or keep Stan from saving these files? I tried using the delete_fit-function from httpstan.cache, which requires you to specify an identifier for the (e.g. model_name), which is easy to obtain, and an identifier for the fit (e.g. fit_name), which I am not sure how to obtain (there is a calculate_fit_name-function in httpstan.fits, but I cannot get it to work). The documentation on how to use these functions (calculate_fit_name and delete_fit) is not clear to me.
Suggest a potential alternative/fix
Could you provide a use case on how to delete model fits from cache (in a setting where a new model is fitted within each iteration of a for-loop)?