stan-dev / stan

Stan development repository. The master branch contains the current release. The develop branch contains the latest stable development. See the Developer Process Wiki for details.
https://mc-stan.org
BSD 3-Clause "New" or "Revised" License
2.57k stars 368 forks source link

Using big data (50GB) sets with Stan on a Logistic Regression model #2221

Closed ghost closed 7 years ago

ghost commented 7 years ago

Summary:

Using big data (50GB) sets with Stan on a logistic regression model

Description:

I am currently working with PySpark and PyMC3 on a binary classification problem. I would like to use Stan for Bayesian Logistic Regression on large data sets on the order of ~50GB and around 100 co-variates/features.

Questions:

  1. Has anyone had success with such large data sets?
  2. How did you overcome the memory limitations of passing a data set from R and/or Python to Stan and the fact that Stan is single-threaded?
  3. Assuming I have a machine with 256GB of memory, can Stan read data directly from Hadoop/S3 using C++ without going through R or Python first?

Many thanks, Shlomo.

bob-carpenter commented 7 years ago

Thanks for asking. Stan builds an expression graph for the log density, which requires about 40 bytes per subexpression. So I don't think it'll fit in 256GB.

The only way to make this fly would be to build a custom logistic regression C++ function with analytic gradients (not that hard).

But at the point you have 50GB of data and you're fitting 100 covariates, you probably don't need Bayesian methods. Just use a stochastic gradient method that doesn't keep all the data in memory; Vowpal Wabbit will probably work.

I'm going to close this issue, because we limit them to technical feature specifications that have a clear path to implementation. We discuss bigger issues on the mailing list and preliminary designs on the Wiki. This is related to stocahstic methods and data distributed methods which we are thinking about.

ghost commented 7 years ago

Dear Bob, Thanks for the prompt reply. I do understand that the mailing list is a better place for this thread, should I open the discussion there?

I have no problem writing a custom LR function as I am fluent in C++, however I would like to understand your comment recited "you probably don't need Bayesian methods". Can you please elaborate? The whole point in my perspective was to try this directly on Stan; I have a fully working solution using Spark, however that does not involve priors and/or any Bayesian methods.

Best,

bob-carpenter commented 7 years ago

Sure. But none of this is Stan related.

The posterior will converge to a delta function around the penalized maximum likelihood estimate (where the prior defines the penalty). So full Bayes (which uses posterior estimation uncertainty in posterior inference) doesn't buy you much over just plugging in a point estimate.

On Feb 1, 2017, at 4:32 PM, Solomon K. notifications@github.com wrote:

Dear Bob, Thanks for the prompt reply. I do understand that the mailing list is a better place for this thread, should I open the discussion there?

I have no problem writing a custom LR function as I am fluent in C++, however I would like to understand you comment recited "you probably don't need Bayesian methods". Can you please elaborate? The whole point in my perspective was to try this directly on Stan; I have a fully working solution using Spark, however that does not involve priors and/or any Bayesian methods.

Best,

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.

betanalpha commented 7 years ago

More importantly, by the time you’ve fit on hundreds of thousands of data the posterior variances will shrink below the hidden bias from assuming that everyone in your giant sample behaves exactly the same. Then you’ll need to build something more elaborate, like a hierarchical logistic regression which will cause your parameters to explore from hundreds to millions, even with just hundreds of thousands of data. Spark isn’t going to do anything on that. NUTS will still be your best bet, but it’ll be on the edge of feasibility.

On Feb 1, 2017, at 5:00 PM, Bob Carpenter notifications@github.com wrote:

Sure. But none of this is Stan related.

The posterior will converge to a delta function around the penalized maximum likelihood estimate (where the prior defines the penalty). So full Bayes (which uses posterior estimation uncertainty in posterior inference) doesn't buy you much over just plugging in a point estimate.

  • Bob

On Feb 1, 2017, at 4:32 PM, Solomon K. notifications@github.com wrote:

Dear Bob, Thanks for the prompt reply. I do understand that the mailing list is a better place for this thread, should I open the discussion there?

I have no problem writing a custom LR function as I am fluent in C++, however I would like to understand you comment recited "you probably don't need Bayesian methods". Can you please elaborate? The whole point in my perspective was to try this directly on Stan; I have a fully working solution using Spark, however that does not involve priors and/or any Bayesian methods.

Best,

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stan/issues/2221#issuecomment-276796685, or mute the thread https://github.com/notifications/unsubscribe-auth/ABdNltvj9KJ1_exDI9sWT1g1T8252_0Zks5rYQBugaJpZM4Lzk4r.

jgabry commented 7 years ago

More importantly, by the time you’ve fit on hundreds of thousands of data the posterior variances will shrink below the hidden bias from assuming that everyone in your giant sample behaves exactly the same.

+1. This is a super important point that often goes unmentioned in discussions like this one.

statwonk commented 6 years ago

@betanalpha do you have or know of any resources where I could read more about this?

the hidden bias from assuming that everyone in your giant sample behaves exactly the same

Do you mean exchangeability / the iid assumption?

betanalpha commented 6 years ago

The IID assumption gives you the typical logistic regression. Exchangeability is a weaker assumption that is consistent with heterogeneity in the population, but that gives you hierarchical logistic regression, not regular logistic regression.