exclude by parameter name prefix and exclude '$raw' by default

jpritikin commented 6 years ago

This is a feature request. It would be cool if parameters could be excluded by partial matching (or regular expression). Furthermore, I suggest that any parameter with a prefix of 'raw' be excluded by default. Ideally this would be built into cmdstan to avoid writing these parameters to disk.

syclik commented 6 years ago

Thanks for the request. Can you be more specific? Things that I don't understand:

why should anything that starts with raw be excluded by default? (what's special about that prefix? is there some text that suggests that something not saved be prefixed with raw?)
what do you mean excluded? One option is to completely avoid writing them to disk, as suggested. I don't think that's a good idea. Do you just not want it reported in stansummary?

Perhaps you can provide a full example? (and using the issue template is useful for that reason)

jpritikin commented 6 years ago

what's special about that prefix?

Nothing special about the prefix 'raw'. Just pick some prefix.

One option is to completely avoid writing them to disk, as suggested. I don't think that's a good idea.

Why isn't it a good idea? Here's an example. Suppose I declare a parameter,

  cholesky_factor_corr[NFACETS] rawThetaCorChol;

But this is not in units that I prefer to interpret. I'd rather see it as a correlation matrix. So in generated quantities block, I have,

  thetaCor = multiply_lower_tri_self_transpose(rawThetaCorChol);

If variables with prefix 'raw' are excluded by default then I only see thetaCor, which is what I want.

syclik commented 6 years ago

Nothing special about the prefix 'raw'. Just pick some prefix.

Got it. Thanks for the clarification.

Why isn't it a good idea?

Thanks for the example. That helps frame the request a bit. Can you dig into your use case a little bit? What are reasons you want to do this? These things come to mind, but I'm not sure what's your motivation:

lack of disk space
disk I/O is expensive in your case
network access is expensive
takes too long to compute things based on the output files

The reason I think it is a bad idea might not be practically useful. I was thinking that the state of the output might be inconsistent. The adaptation info may not correspond to the draws. But since we're not really doing much with this sort of information, we should do what's good for users.

jpritikin commented 6 years ago

Here is a better example, model2.stan.txt. When I sample this model on one of my datasets (1000 warmup and 1000 samples), the resulting file is 162M. If I discard all the rawTheta parameters then the resulting file reduces to 152M. That's a savings of about 60M for 6 chains. I store these results on rotating storage so I probably save a few hundred milliseconds by discarding rawTheta. Another benefit is that reports from rstan like Rhat can generated without thinking about which parameters to exclude. I confess these are probably small benefits, but I think it's worth it. It smooths the user experience.

syclik commented 6 years ago

Thank you. That helps quite a bit. Just to get more specific, what does that 60M impact? Are you trying to conserve disk space? Network bandwidth? Or just simplicity in importing into rstan?

To what extent do you use rstan for this example?

On Mon, May 21, 2018 at 9:31 AM, Joshua Pritikin notifications@github.com wrote:

Here is a better example, model2.stan.txt https://github.com/stan-dev/cmdstan/files/2022681/model2.stan.txt. When I sample this model on one of my datasets (1000 warmup and 1000 samples), the resulting file is 162M. If I discard all the rawTheta parameters then the resulting file reduces to 152M. That's a savings of about 60M for 6 chains. I store these results on rotating storage so I probably save a few hundred milliseconds by discarding rawTheta. Another benefit is that reports from rstan like Rhat can generated without thinking about which parameters to exclude. I confess these are probably small benefits, but I think it's worth it. It smooths the user experience.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stan-dev/cmdstan/issues/619#issuecomment-390654400, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZ_F-ayB-RM-9eO9-9MPdGCQA1DuNVHks5t0sGigaJpZM4UFtBq .

bob-carpenter commented 6 years ago

This should probably be something implemented at the services level in stan-dev/stan, then wrapped by stan-dev/cmdstan.

Bob

On May 21, 2018, at 9:36 AM, Daniel Lee notifications@github.com wrote:

Thank you. That helps quite a bit. Just to get more specific, what does that 60M impact? Are you trying to conserve disk space? Network bandwidth? Or just simplicity in importing into rstan?

To what extent do you use rstan for this example?

On Mon, May 21, 2018 at 9:31 AM, Joshua Pritikin notifications@github.com wrote:

Here is a better example, model2.stan.txt https://github.com/stan-dev/cmdstan/files/2022681/model2.stan.txt. When I sample this model on one of my datasets (1000 warmup and 1000 samples), the resulting file is 162M. If I discard all the rawTheta parameters then the resulting file reduces to 152M. That's a savings of about 60M for 6 chains. I store these results on rotating storage so I probably save a few hundred milliseconds by discarding rawTheta. Another benefit is that reports from rstan like Rhat can generated without thinking about which parameters to exclude. I confess these are probably small benefits, but I think it's worth it. It smooths the user experience.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stan-dev/cmdstan/issues/619#issuecomment-390654400, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZ_F-ayB-RM-9eO9-9MPdGCQA1DuNVHks5t0sGigaJpZM4UFtBq .

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

wds15 commented 6 years ago

Have u considered compressing the csv files? I think rstan can just read in the compressed csv as well and compressing the csv's saves you a lot...but this a while ago that I did this so I hope my memory serves me well.

jpritikin commented 6 years ago

I use read_stan_csv and then save the resulting object to an rda.

stan-dev / cmdstan

exclude by parameter name prefix and exclude '$raw' by default #619