stan-dev / stan

Stan development repository. The master branch contains the current release. The develop branch contains the latest stable development. See the Developer Process Wiki for details.
https://mc-stan.org
BSD 3-Clause "New" or "Revised" License
2.56k stars 365 forks source link

Two-Parameter Pareto Distribution #580

Closed bob-carpenter closed 9 years ago

bob-carpenter commented 10 years ago

The Pareto distribution currently in Stan is the one parameter version (also known as the single-parameter (A.4.1.4 - pdf) or Pareto Type I). In actuarial science, the Pareto is one of the most commonly used long-tailed distributions, but primarily in its two-parameter version A.2.4.1, also known as Pareto Type II or Lomax. I've looked at pareto.hpp, and the code is way beyond my meager skills. How difficult would it be to add this flavor of Pareto to Stan?

Originally suggested by Avraham Adler on RStan issue https://github.com/stan-dev/rstan/issues/53 and moved here:

andrewgelman commented 10 years ago

Speaking of priorities . . . I think that new distributions like this do not need to be high priority at all! But maybe they are so easy to implement that it’s no big deal, I don’t know. For this sort of specialized application, it would seem ideal to get some outsider to program it, rather than it taking up the finite time of the core team. A

On Feb 25, 2014, at 9:22 PM, Bob Carpenter notifications@github.com wrote:

The Pareto distribution currently in Stan is the one parameter version (also known as the single-parameter (A.4.1.4 - pdf) or Pareto Type I). In actuarial science, the Pareto is one of the most commonly used long-tailed distributions, but primarily in its two-parameter version A.2.4.1, also known as Pareto Type II or Lomax. I've looked at pareto.hpp, and the code is way beyond my meager skills. How difficult would it be to add this flavor of Pareto to Stan?

Originally suggested by Avraham Adler on RStan issue stan-dev/rstan#53 and moved here:

— Reply to this email directly or view it on GitHub.

bob-carpenter commented 10 years ago

Agreed.

They're more trivial than new output formats, but less trivial than adding functions, adding integrators to solve diff eqs, solving implicit equations, creating a whole new auto-diff or even finishing the old auto-diff, RHMC, MML, EP, labeling statements, VB, new convergence diagonistics, etc. etc.

On Feb 25, 2014, at 8:24 PM, Andrew Gelman notifications@github.com wrote:

Speaking of priorities . . . I think that new distributions like this do not need to be high priority at all! But maybe they are so easy to implement that it’s no big deal, I don’t know. For this sort of specialized application, it would seem ideal to get some outsider to program it, rather than it taking up the finite time of the core team. A

On Feb 25, 2014, at 9:22 PM, Bob Carpenter notifications@github.com wrote:

The Pareto distribution currently in Stan is the one parameter version (also known as the single-parameter (A.4.1.4 - pdf) or Pareto Type I). In actuarial science, the Pareto is one of the most commonly used long-tailed distributions, but primarily in its two-parameter version A.2.4.1, also known as Pareto Type II or Lomax. I've looked at pareto.hpp, and the code is way beyond my meager skills. How difficult would it be to add this flavor of Pareto to Stan?

Originally suggested by Avraham Adler on RStan issue stan-dev/rstan#53 and moved here:

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

aadler commented 10 years ago

Getting the Pareto in Stan can be done through customizing a call to increment_log_prob, of course, which makes this technically unnecessary. However, the custom function is more likely slower and does not have the checks that built-in functions do. By all means, set this with an extremely low priority, but if someone wants to tackle what I hope would be a simple project for someone who knows C, that would be great.

andrewgelman commented 10 years ago

Yes, I agree that it would be great for projects like this to be done by people who have a stake in the results!

On Feb 25, 2014, at 9:49 PM, Avraham Adler notifications@github.com wrote:

It can be done through customizing a call to increment_log_prob, of course, but that is more likely slower and does not have the checks that built-in functions do. By all means, set this as low a priority as necessary, but if someone wants to tackle what I hope would be a simple project for someone who knows C, that would be great.

— Reply to this email directly or view it on GitHub.

andrewgelman commented 10 years ago

Bob I’m not thrilled with that list below because it mixes different sorts of tasks. Here’s a rough categorization.

I realize the above divisions are only approximate. I just don’t want the items on the bottom of the list to intimidate us from doing items in the middle of the list. Some of my frustration comes up when people say we can’t do things that we really can do. For example, I had the impression that we couldn’t save precompiled models from one R session to the next, but it turns out that we can.

My impression is that we do have a plan for now, with three strands: (a) Beginning to add functions to Stan, with an eye toward implementing differential equations as Stan functions. (b) Implementing 2nd derivatives using the existing approach, then implementing mle with standard errors as an R package, then doing marginal mle (c ) I can’t remember now, but I think there’s a 3rd big thing that you’re working on now?

A

On Feb 25, 2014, at 9:35 PM, Bob Carpenter notifications@github.com wrote:

Agreed.

They're more trivial than new output formats, but less trivial than adding functions, adding integrators to solve diff eqs, solving implicit equations, creating a whole new auto-diff or even finishing the old auto-diff, RHMC, MML, EP, labeling statements, VB, new convergence diagonistics, etc. etc.

  • Bob

On Feb 25, 2014, at 8:24 PM, Andrew Gelman notifications@github.com wrote:

Speaking of priorities . . . I think that new distributions like this do not need to be high priority at all! But maybe they are so easy to implement that it’s no big deal, I don’t know. For this sort of specialized application, it would seem ideal to get some outsider to program it, rather than it taking up the finite time of the core team. A

On Feb 25, 2014, at 9:22 PM, Bob Carpenter notifications@github.com wrote:

The Pareto distribution currently in Stan is the one parameter version (also known as the single-parameter (A.4.1.4 - pdf) or Pareto Type I). In actuarial science, the Pareto is one of the most commonly used long-tailed distributions, but primarily in its two-parameter version A.2.4.1, also known as Pareto Type II or Lomax. I've looked at pareto.hpp, and the code is way beyond my meager skills. How difficult would it be to add this flavor of Pareto to Stan?

Originally suggested by Avraham Adler on RStan issue stan-dev/rstan#53 and moved here:

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

betanalpha commented 10 years ago

Andrew,

It’s not that simple. Many of the items “in the middle of the list” might be straightforward algorithmically but require lots of thought and code in their implementation. Moreover, we’re not really hiding anything. The precompiled R models are a tricky issue; not only are there only a few people who know R well enough to get deep into the internals but also even if it’s possible it doesn’t mean that it’s practical or easy to deploy to users with minimal difficulty. R sucks — Windows sucks — and until we stop supporting them these things are going to be slow and deliberate.

On Feb 25, 2014, at 8:56 PM, Andrew Gelman notifications@github.com wrote:

Bob I’m not thrilled with that list below because it mixes different sorts of tasks. Here’s a rough categorization.

  • Things we can definitely do, they’re just implementation, not research: new output formats, new distributions, finishing the old auto-diff
  • Things that are not statistics or computing research but require some implementation decisions: adding functions, adding integrators to solve diff eqs, the new auto-diff, cleaning up rstan
  • Things that require a very small amount of statistics/computing research: mle with standard errors
  • Things that require some research but we’re pretty sure are doable: marginal mle, maybe rhmc falls in this class too
  • Things that we don’t really know how to do, truly vaporware at this point: black-box EP, black-box VB

I realize the above divisions are only approximate. I just don’t want the items on the bottom of the list to intimidate us from doing items in the middle of the list. Some of my frustration comes up when people say we can’t do things that we really can do. For example, I had the impression that we couldn’t save precompiled models from one R session to the next, but it turns out that we can.

My impression is that we do have a plan for now, with three strands: (a) Beginning to add functions to Stan, with an eye toward implementing differential equations as Stan functions. (b) Implementing 2nd derivatives using the existing approach, then implementing mle with standard errors as an R package, then doing marginal mle (c ) I can’t remember now, but I think there’s a 3rd big thing that you’re working on now?

A

On Feb 25, 2014, at 9:35 PM, Bob Carpenter notifications@github.com wrote:

Agreed.

They're more trivial than new output formats, but less trivial than adding functions, adding integrators to solve diff eqs, solving implicit equations, creating a whole new auto-diff or even finishing the old auto-diff, RHMC, MML, EP, labeling statements, VB, new convergence diagonistics, etc. etc.

  • Bob

On Feb 25, 2014, at 8:24 PM, Andrew Gelman notifications@github.com wrote:

Speaking of priorities . . . I think that new distributions like this do not need to be high priority at all! But maybe they are so easy to implement that it’s no big deal, I don’t know. For this sort of specialized application, it would seem ideal to get some outsider to program it, rather than it taking up the finite time of the core team. A

On Feb 25, 2014, at 9:22 PM, Bob Carpenter notifications@github.com wrote:

The Pareto distribution currently in Stan is the one parameter version (also known as the single-parameter (A.4.1.4 - pdf) or Pareto Type I). In actuarial science, the Pareto is one of the most commonly used long-tailed distributions, but primarily in its two-parameter version A.2.4.1, also known as Pareto Type II or Lomax. I've looked at pareto.hpp, and the code is way beyond my meager skills. How difficult would it be to add this flavor of Pareto to Stan?

Originally suggested by Avraham Adler on RStan issue stan-dev/rstan#53 and moved here:

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

andrewgelman commented 10 years ago

Mike: I’m not saying these things are trivial, I just want to distinguish how much research would be required. Black box VB would be great but we don’t have a real algorithm for it. Regarding the precompiled R models: No no no. Sebastian just sent us 10 lines of code (or however many lines it was) that allow a user to precompile Stan models in R. It could be set up as an R function and any user could use it. A

On Feb 25, 2014, at 10:22 PM, Michael Betancourt notifications@github.com wrote:

Andrew,

It’s not that simple. Many of the items “in the middle of the list” might be straightforward algorithmically but require lots of thought and code in their implementation. Moreover, we’re not really hiding anything. The precompiled R models are a tricky issue; not only are there only a few people who know R well enough to get deep into the internals but also even if it’s possible it doesn’t mean that it’s practical or easy to deploy to users with minimal difficulty. R sucks — Windows sucks — and until we stop supporting them these things are going to be slow and deliberate.

On Feb 25, 2014, at 8:56 PM, Andrew Gelman notifications@github.com wrote:

Bob I’m not thrilled with that list below because it mixes different sorts of tasks. Here’s a rough categorization.

  • Things we can definitely do, they’re just implementation, not research: new output formats, new distributions, finishing the old auto-diff
  • Things that are not statistics or computing research but require some implementation decisions: adding functions, adding integrators to solve diff eqs, the new auto-diff, cleaning up rstan
  • Things that require a very small amount of statistics/computing research: mle with standard errors
  • Things that require some research but we’re pretty sure are doable: marginal mle, maybe rhmc falls in this class too
  • Things that we don’t really know how to do, truly vaporware at this point: black-box EP, black-box VB

I realize the above divisions are only approximate. I just don’t want the items on the bottom of the list to intimidate us from doing items in the middle of the list. Some of my frustration comes up when people say we can’t do things that we really can do. For example, I had the impression that we couldn’t save precompiled models from one R session to the next, but it turns out that we can.

My impression is that we do have a plan for now, with three strands: (a) Beginning to add functions to Stan, with an eye toward implementing differential equations as Stan functions. (b) Implementing 2nd derivatives using the existing approach, then implementing mle with standard errors as an R package, then doing marginal mle (c ) I can’t remember now, but I think there’s a 3rd big thing that you’re working on now?

A

On Feb 25, 2014, at 9:35 PM, Bob Carpenter notifications@github.com wrote:

Agreed.

They're more trivial than new output formats, but less trivial than adding functions, adding integrators to solve diff eqs, solving implicit equations, creating a whole new auto-diff or even finishing the old auto-diff, RHMC, MML, EP, labeling statements, VB, new convergence diagonistics, etc. etc.

  • Bob

On Feb 25, 2014, at 8:24 PM, Andrew Gelman notifications@github.com wrote:

Speaking of priorities . . . I think that new distributions like this do not need to be high priority at all! But maybe they are so easy to implement that it’s no big deal, I don’t know. For this sort of specialized application, it would seem ideal to get some outsider to program it, rather than it taking up the finite time of the core team. A

On Feb 25, 2014, at 9:22 PM, Bob Carpenter notifications@github.com wrote:

The Pareto distribution currently in Stan is the one parameter version (also known as the single-parameter (A.4.1.4 - pdf) or Pareto Type I). In actuarial science, the Pareto is one of the most commonly used long-tailed distributions, but primarily in its two-parameter version A.2.4.1, also known as Pareto Type II or Lomax. I've looked at pareto.hpp, and the code is way beyond my meager skills. How difficult would it be to add this flavor of Pareto to Stan?

Originally suggested by Avraham Adler on RStan issue stan-dev/rstan#53 and moved here:

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub.

betanalpha commented 10 years ago

I’m not saying these things are trivial, I just want to distinguish how much research would be required. Black box VB would be great but we don’t have a real algorithm for it.

It’s research verses implementation/design/architecture. One is not faster than the other.

Regarding the precompiled R models: No no no. Sebastian just sent us 10 lines of code (or however many lines it was) that allow a user to precompile Stan models in R. It could be set up as an R function and any user could use it.

A user who has already installed Rstan and is happy with R. The point is that (a) this will not lead to something as easy as glmer et. al because we still have all of the issues of trying to install Stan, (b) it would be limited to R and we’d have to start offering algorithms in Rstan and nowhere else, and (c) we do not have the kind of R expertise on the dev team to whip these things out quickly even if they are out there and ready to be exploited.

In other words, a solution has to exist, be found, and then be scalable.

andrewgelman commented 10 years ago

On Feb 25, 2014, at 10:42 PM, Michael Betancourt notifications@github.com wrote:

I’m not saying these things are trivial, I just want to distinguish how much research would be required. Black box VB would be great but we don’t have a real algorithm for it.

It’s research verses implementation/design/architecture. One is not faster than the other.

No. it’s (1) research + (2) implementation/design/architecture vs. (2) implementation/design/architecture. (1) + (2) takes longer than (2) alone!

Regarding the precompiled R models: No no no. Sebastian just sent us 10 lines of code (or however many lines it was) that allow a user to precompile Stan models in R. It could be set up as an R function and any user could use it.

A user who has already installed Rstan and is happy with R. The point is that (a) this will not lead to something as easy as glmer et. al because we still have all of the issues of trying to install Stan, (b) it would be limited to R and we’d have to start offering algorithms in Rstan and nowhere else, and (c) we do not have the kind of R expertise on the dev team to whip these things out quickly even if they are out there and ready to be exploited.

I’m not saying that Sebastian’s code solves the difficulties of Stan install. I’m saying that Sebastian’s code will be useful to us in our lm_stan, glm_stan, etc., implementations.

bob-carpenter commented 10 years ago

On Feb 25, 2014, at 9:42 PM, Michael Betancourt notifications@github.com wrote:

Regarding the precompiled R models: No no no. Sebastian just sent us 10 lines of code (or however many lines it was) that allow a user to precompile Stan models in R. It could be set up as an R function and any user could use it.

A user who has already installed Rstan and is happy with R. The point is that (a) this will not lead to something as easy as glmer et. al because we still have all of the issues of trying to install Stan, (b) it would be limited to R and we’d have to start offering algorithms in Rstan and nowhere else, and (c) we do not have the kind of R expertise on the dev team to whip these things out quickly even if they are out there and ready to be exploited.

I think (a) is a separate issue. But yes, installing with R is still a pain, at least until we can get Ben's script under control and working across platforms and compilers in R. I would dearly love it if someone took over that task and made it available in our doc so that our users could get at it. Every time I try to do something in R, it fails for reasons I don't understand and can't debug.

The solution for (b) is to an R-specific problem! Of course we can reuse compiled models in CmdStan --- they're just executables! No idea about PyStan, which leads me to...

As to (c), the only people who know how to do anything at all complicated in R are Ben and Jiqiang. And as far as I know, Allen's still the only one with any Python expertise.

bob-carpenter commented 10 years ago

On Feb 25, 2014, at 9:54 PM, Andrew Gelman notifications@github.com wrote:

I’m not saying that Sebastian’s code solves the difficulties of Stan install. I’m saying that Sebastian’s code will be useful to us in our lm_stan, glm_stan, etc., implementations.

My understanding is that lm_stan and glm_stan will be R-specific packages that will depend on rstan and not something we'll need to also implement in CmdStan or PyStan.

andrewgelman commented 10 years ago

I don’t know about CmdStan because I’m not quite sure who the audience is for these, but I could well imagine this stuff would ultimately be in PyStan. One of my goals is for people to phase out model-specific fitting algorithms (for example, least squares for linear regression, or iteratively weighted least squares for logistic regression) and instead to do more and more in the general Stan engine. I think this ultimately should make statistical inference “lighter” and less messy for users. Instead of having to switch from one program to another as the model or inferential goals change, the user just keeps running stan. This is one reason I think mle_stan will be such a good thing. From this perspective, lm_stan and glm_stan are important because their existence will mean that users no longer need to rely on any specialized lm or glm software. Of course they could do it all using mle_stan and writing their own models, but I don’t think it’s unreasonable to have some aliases for common examples.

In any case, we can do it all in R to start with and then put it into Stan more generally later.

A

On Feb 25, 2014, at 11:45 PM, Bob Carpenter notifications@github.com wrote:

On Feb 25, 2014, at 9:54 PM, Andrew Gelman notifications@github.com wrote:

I’m not saying that Sebastian’s code solves the difficulties of Stan install. I’m saying that Sebastian’s code will be useful to us in our lm_stan, glm_stan, etc., implementations.

My understanding is that lm_stan and glm_stan will be R-specific packages that will depend on rstan and not something we'll need to also implement in CmdStan or PyStan.

  • Bob — Reply to this email directly or view it on GitHub.