stan-dev / rstan

RStan, the R interface to Stan
https://mc-stan.org
1.04k stars 266 forks source link

thin or get less iteration of only certain parameters (request) #398

Open bnicenboim opened 7 years ago

bnicenboim commented 7 years ago

Summary:

It would be nice to be able to get less samples of some parameters (e.g. predictions or less log-likelihoods from generated quantities section) to avoid models of more than 4GB.

Description:

I usually run a model with 2000 iterations and 4 chains. I have predicted values, and the log likelihood of every observation in my generated quantities, but I don't really need 4000 samples of each one. And if I don't exclude them when I fit a model with stan(...) and I end up with a model of more than 4gb. (As you know R is not very friendly to huge files).

(Edited:Ok, I see now, that it's possible to save directly a huge csv file, without loading all the samples in memory. It still would be nice to be able to split the file between the main parameters, and those that clog R like likelihoods or predicted values, or to only save them)

Reproducible Steps:

Just using model <- stan(...) without excluding parameters

Current Output:

A model with all the samples of all the parameters > 1gb.

Expected Output:

A model with less the samples of some of the parameters.

RStan Version:

2.14.1

R Version:

3.3.2 (2016-10-31)

Operating System:

Ubuntu 16.04

bob-carpenter commented 7 years ago

Thanks for submitting this as a feature request issue.

Why not just thin your entire sample rather than only some columns of it?

Very soon, we hope to roll out a feature that lets you run just a generated quantities block. Your feature request makes me realize I haven't thought through how thinning would work in that context.

I'm pretty sure all of the posterior analysis code assumes rectangular structures. Your request would involve a rewrite of all of that code.

bnicenboim commented 7 years ago

I thought that it wasn't a good idea to thin parameters if I want to calculate 95% credible intervals. Won't I loose effective sample size as well? (Or am I wrong?)

But I guess that for my request to be able to run the generated quantities with a subset of my samples will be enough. (I don't really follow the "rectangular structures" thing).

Any idea of when this new feature you mention will appear?

bob-carpenter commented 7 years ago

Yes, you will lose effective sample size if you thin.

That's true for any quantities you thin.

95% intervals are compute intensive because the lower and upper bounds are only sensitive to the bottom 2.5% and top 2.5% of the draws.

I've learned not to make promises on new feature arrival dates :-)

On Feb 26, 2017, at 12:11 PM, Bruno Nicenboim notifications@github.com wrote:

I thought that it wasn't a good idea to thin parameters if I want to calculate 95% credible intervals. Won't I loose effective sample size as well? (Or am I wrong?)

But I guess that for my request to be able to run the generated quantities with a subset of my samples will be enough. (I don't really follow the "rectangular structures" thing).

Any idea of when this new feature you mention will appear?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

bnicenboim commented 7 years ago

Ok, so if I can run my model without the generated quantities, and run them separately (with a subset of the samples) then that's the feature I want. Feel free to add a link to the new feature and close this issue.

As always thanks for the super fast responses. (I still don't understand how the few of you manage to answer so fast to everything!). (And by the way, is it ok github for feature requests? I have a few more coming)

bob-carpenter commented 7 years ago

Yes, absolutely it's OK to use GitHub for issues.

Some of us work on Stan full time, so it shouldn't be too surprising. We may need to start redirecting people to things like Stats overflow or whatever it's called for general stats questions. We are getting a bit overwhelmed. I'm intentionally trying to introduce a one-day lag into my responses to give other people a chance to jump in and to cut down on traffic.

On Feb 27, 2017, at 8:43 AM, Bruno Nicenboim notifications@github.com wrote:

Ok, so if I can run my model without the generated quantities, and run them separately (with a subset of the samples) then that's the feature I want. Feel free to add a link to the new feature and close this issue.

As always thanks for the super fast responses. (I still don't understand how the few of you manage to answer so fast to everything!). (And by the way, is it ok github for feature requests? I have a few more coming)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

bnicenboim commented 2 years ago

I found my own issue, this is not relevant anymore since the generated quantities works without the sampling now. Feel free to close.