stan-dev / stan

Stan development repository. The master branch contains the current release. The develop branch contains the latest stable development. See the Developer Process Wiki for details.
https://mc-stan.org
BSD 3-Clause "New" or "Revised" License
2.6k stars 370 forks source link

Stan Only Writes 1,000 Samples #148

Closed ctross closed 11 years ago

ctross commented 11 years ago

Dear Team,

I am currently without stable internet access, so I haven't had the time to dig through the archives to see if this issue has been addressed before. I apologize if this possible bug has been reported before.

I am fitting a fairly large multi-level model linking time series of ecological proxies to time series of crop yields. I followed the STAN manual for running parallel chains (I ran 2 instead of 4), I set warmup to 2,000 and iter to 7,000.

After waiting about 60 hours for my STAN model to finish sampling, only 1,000 of the 5,000 samples found their way into the .csv files for each chain. Notably, however, the command prompt reported the sampling progress in intervals of 50, as I specified, and reported sampling occurring for all 7,000 iterations. Likewise, the .csv files increased in size over the entire 60 hour period. The 1,000 samples, however, were perfectly consistent with the expected model output (the samples were consistent with the corresponding parameter estimates, from when I ran the sub-models of this larger model independently). This leads to me that the MCMC algorithm is working correctly, but some samples are being "lost" somehow after sampling.

This suggests to me that there might be some glitch that is automatically thinning the samples before writing to csv, and limiting the number of recorded samples to the default value of 1,000. This, however, is just a shot in the dark. I'm unsure of how to investigate if this is actually the case.

I'd like to get this project completed, so any advice on how to get the full batch of samples saved would be much appreciated. STAN is the only MCMC sampling engine that has been able to provide stable estimations of model parameter, with indications of multiple chain convergence, and good mixing, for this model. JAGS, WinBUGS, and OpenBUGS, failed to produce signs of convergence, even with runs of 500,000 samples or more.

Notes: Stan 1.3.0 GCC 4.6.3 2011208 prerelease Windows 7 Premium Intel i3-2310M

Let me know if any more information is needed,

Best wishes,

maverickg commented 11 years ago

The chains are thinned so that there are 1000 iterations saved by default.

syclik commented 11 years ago

As Jiqiang mentioned, the chains are thinned. If you want the behavior you're looking for, set the thin option to 1.

Closing this issue.

On Mon, Jul 8, 2013 at 2:39 PM, maverickg notifications@github.com wrote:

The chains are thinned so that there are 1000 iterations saved by default.

— Reply to this email directly or view it on GitHubhttps://github.com/stan-dev/stan/issues/148#issuecomment-20626656 .

bob-carpenter commented 11 years ago

Is there an inference problem where you need more than 1000 samples thinned out of 5000?

What's important for posterior expectations is the effective sample size (n_eff in Stan's output), because that's what determines the MCMC standard error on the estimate.

Thinning 5000 samples to 1000 doesn't usually cut n_eff by a factor of 5. And you usually only need a few hundred effective samples for most inference problems.

So you might be disappointed if you run again and save everything.

On 7/8/13 2:30 PM, Ctross wrote:

Dear Team,

I am currently without stable internet access, so I haven't had the time to dig through the archives to see if this issue has been addressed before. I apologize if this possible bug has been reported before.

I am fitting a fairly large multi-level model linking time series of ecological proxies to time series of crop yields. I followed the STAN manual for running parallel chains (I ran 2 instead of 4), I set warmup to 2,000 and iter to 7,000.

After waiting about 60 hours for my STAN model to finish sampling, only 1,000 of the 5,000 samples found their way into the .csv files for each chain. Notably, however, the command prompt reported the sampling progress in intervals of 50, as I specified, and reported sampling occurring for all 7,000 iterations. Likewise, the .csv files increased in size over the entire 60 hour period. The 1,000 samples, however, were perfectly consistent with the expected model output (the samples were consistent with the corresponding parameter estimates, from when I ran the sub-models of this larger model independently). This leads to me that the MCMC algorithm is working correctly, but some samples are being "lost" somehow after sampling.

This suggests to me that there might be some glitch that is automatically thinning the samples before writing to csv, and limiting the number of recorded samples to the default value of 1,000. This, however, is just a shot in the dark. I'm unsure of how to investigate if this is actually the case.

I'd like to get this project completed, so any advice on how to get the full batch of samples saved would be much appreciated. STAN is the only MCMC sampling engine that has been able to provide stable estimations of model parameter, with indications of multiple chain convergence, and good mixing, for this model. JAGS, WinBUGS, and OpenBUGS, failed to produce signs of convergence, even with runs of 500,000 samples or more.

Notes: Stan 1.3.0 GCC 4.6.3 2011208 prerelease Windows 7 Premium Intel i3-2310M

Let me know if any more information is needed,

Best wishes,

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/148.

ctross commented 11 years ago

Bob,

This model has been a pain to say the least. And, given the performance of JAGS, and BUGS, I've become nervous about trusting MCMC results for this model's posterior. Even with ~10,000 samples, effective sample size of the sub-models was low. I'd rather let the model run for a week and find that 5,000 thinned at int=5 was as good as 5,000 to begin with, than the opposite.

One thing I noticed in the output on one chain from this MCMC run was that halfway through, the tree depth reduced from ~9-10 down to 2, and stayed there. The resulting samples were almost purely from the mean of the previous 500 samples, but with almost no variance (a straight horizontal line in the traceplot). Is this indicative of problem in the model specification? Or is this indicative that the system moved into a high probability region that 'was hard to get out of', (im thinking of Neal's funnel from the stan manual)?

On Mon, Jul 8, 2013 at 2:22 PM, Bob Carpenter notifications@github.comwrote:

Is there an inference problem where you need more than 1000 samples thinned out of 5000?

What's important for posterior expectations is the effective sample size (n_eff in Stan's output), because that's what determines the MCMC standard error on the estimate.

Thinning 5000 samples to 1000 doesn't usually cut n_eff by a factor of 5. And you usually only need a few hundred effective samples for most inference problems.

So you might be disappointed if you run again and save everything.

  • Bob

On 7/8/13 2:30 PM, Ctross wrote:

Dear Team,

I am currently without stable internet access, so I haven't had the time to dig through the archives to see if this issue has been addressed before. I apologize if this possible bug has been reported before.

I am fitting a fairly large multi-level model linking time series of ecological proxies to time series of crop yields. I followed the STAN manual for running parallel chains (I ran 2 instead of 4), I set warmup to 2,000 and iter to 7,000.

After waiting about 60 hours for my STAN model to finish sampling, only 1,000 of the 5,000 samples found their way into the .csv files for each chain. Notably, however, the command prompt reported the sampling progress in intervals of 50, as I specified, and reported sampling occurring for all 7,000 iterations. Likewise, the .csv files increased in size over the entire 60 hour period. The 1,000 samples, however, were perfectly consistent with the expected model output (the samples were consistent with the corresponding parameter estimates, from when I ran the sub-models of this larger model independently). This leads to me that the MCMC algorithm is working correctly, but some samples are being "lost" somehow after sampling.

This suggests to me that there might be some glitch that is automatically thinning the samples before writing to csv, and limiting the number of recorded samples to the default value of 1,000. This, however, is just a shot in the dark. I'm unsure of how to investigate if this is actually the case.

I'd like to get this project completed, so any advice on how to get the full batch of samples saved would be much appreciated. STAN is the only MCMC sampling engine that has been able to provide stable estimations of model parameter, with indications of multiple chain convergence, and good mixing, for this model. JAGS, WinBUGS, and OpenBUGS, failed to produce signs of convergence, even with runs of 500,000 samples or more.

Notes: Stan 1.3.0 GCC 4.6.3 2011208 prerelease Windows 7 Premium Intel i3-2310M

Let me know if any more information is needed,

Best wishes,

— Reply to this email directly or view it on GitHub < https://github.com/stan-dev/stan/issues/148>.

— Reply to this email directly or view it on GitHubhttps://github.com/stan-dev/stan/issues/148#issuecomment-20637111 .

Best, Cody Ross

PhD Candidate Department of Anthropology University of California One Shields Avenue Davis, CA 95616