stan-dev / pystan

PyStan, a Python interface to Stan, a platform for statistical modeling. Documentation: https://pystan.readthedocs.io
ISC License
337 stars 58 forks source link

How to build a model without data and avoid building at run time? #373

Closed eboileau closed 1 year ago

eboileau commented 1 year ago

This is not a bug per se , but rather I feel that this question has not been addressed, or at least I can't find a definitive answer. But I think it is important enough to deserve it's own "bug report", rather than a post on the forums.

From experience, and from #137, or e.g. here, it does NOT seem possible to build a model without data in Pystan3. I don't think #172 is helping to solve the problem, and I can't find out how other developers are wrapping their head around this... I haven't found a suitable solution.

I am moving to Pystan3. Previously, I had several models that were each compiled once (and pickled) e.g. during program installation. During each program execution, these were loaded using pickle.load, and input for the models was constructed on the run e.g.

# construct the input for Stan and sample form each model
data = {...}
fit = [
        m.sampling(
            data=data,
            ...,
        )
        for m in models
    ]

Here, there could be many hundreds of input data (so the above snippet is typically called many times, possibly in parallel), and the data are NOT available prior to running the program, i.e. at installation. I don't even see how building models hundreds of time, each time you run the program, would make sense (and if this would work in parallel).

So any suggestions, ideas, recommendations would be very much welcome. Pystan is great, but I feel this is missing from the documentation, or maybe this needs to be considered more carefully (allowing building with empty data, etc. ), as I'm sure this issue is affecting many developers (as can be seen just googling the question...). Thanks.

riddell-stan commented 1 year ago

Thanks for the comment.

It should be easy to modify the pystan code to do what you describe. There's not much complexity in the pystan source code itself. You could also likely use cmdstan.

In order to make sure that pystan continues to be easy to maintain, we're keeping the set of features limited. If a feature isn't likely to be used by a lot of people, we tend to avoid adding it.

eboileau commented 1 year ago

Thanks for your quick reply.

After trying a few things, it seems possible to build/pickle a model using random data, load it later and sample from it (each call to build using new/different data should be relatively quick and without overhead, in principle, although I haven't tested it on real data), (i) assuming that pystan can find the model and/or that the cache has not been cleaned, which might in general be difficult to control e.g. on cluster infrastructure, etc. (ii) Another problem, if any, is that providing random data is a bit of a hack... see #143

I think the problem is the requirement to provide data when compiling/building and, to be honest, I don't understand the rationale behind it.

Thanks, I will try to investigate how suitable CmdStan (or CmdStanPy) are, and how difficult it would be to re-write our programs/modify the pystan code.

ahartikainen commented 1 year ago

So the problem you have is calling build multiple times in parallel (maybe on multiple machines?)

In serial you only build once and then use the cache.

One option is to build that model with a fill data once and then save / fill httpstan cache next time you need to run the model (given multiple machines)?