stanfordmlgroup / ngboost

Natural Gradient Boosting for Probabilistic Prediction
Apache License 2.0
1.64k stars 215 forks source link

documentation #60

Closed alejandroschuler closed 4 years ago

alejandroschuler commented 4 years ago
anpatton commented 4 years ago

A simple vignette on best ways to access/plot the resulting distributions would be great. Even just a couple lines at the end of the regression example.

kmedved commented 4 years ago

I'd also be very interested in this. I've been excitedly sharing this package with colleagues and friends, and have run into a number of functionality questions from them I can't quite address.

The big questions would be understanding how to access the resulting distributions, and more information regarding which distributions are available. Thanks for the great work.

alejandroschuler commented 4 years ago

Thanks @anpatton and @kmedved for the great feedback!

Right now the distributions that are implemented are what you see here. Most of those are for some experiments we're running to extend NGBoost to right-censored survival data, so the ones to pay attention to are a) Normal (you can also specify a fixed variance of 1 with NormalFixedVar) and b) k_categorical(k) (Bernoulli is just categorical with 2 classes).

Normal is used for regression problems and has two parameters: loc (the mean) and scale (the standard deviation). If you want to get the fit parameters out for any given observation (row) i in your data, you can do it like this (following from the regression example):

i = 0
Y_dists.loc[i], Y_dists.scale[i] 
>>> (18.792273837625157, 1.3536045314080163)

With this you should be able to make a plot of the "bell curve" with mean ~18.79 and s.d. ~1.35 using whatever plotting facilities you like. Note you will have one such bell curve per observation/prediction.

The categorical distribution (via the class factory k_categorical(k)) is used for k-class classification, and has k parameters: the probability of the observation falling in each class. To get those (following from the classification example):

i = 0
Y_dists = ngb.pred_dist(X_test)
Y_dists.probs.T[i]
>>> array([2.29693668e-03, 9.97554771e-01, 1.48292015e-04])

Similarly, with these you can make a probability mass plot however you like.

If either of you wants to play around with this and either contribute a vignette or expand the existing examples I would be happy to accept the pull request!

Implementing other distributions requires a bit of math and some numpy bean-counting. You need to define each scoring function that you want to use applied to that distribution (e.g. log-likelihood, CRPS), the respective gradients with respect to each parameter, and the Riemannian metrics of each scoring function (e.g. Fisher info for log-likelihood). If there is a particular distribution that you want implemented, please let me know and I will prioritize it.

anpatton commented 4 years ago

Thanks @alejandroschuler -- this is exactly what I was looking for.

kmedved commented 4 years ago

Thanks @alejandroschuler; that's very helpful. I will try and put together a vignette based on my work using those examples to contribute a pull request.

With respect to other distributions, two come to mind, one hopefully easy, the other probably harder (or impossible):

1) Skew-T, as implemented in Scipy here: https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.stats.skewnorm.html

2) Burr Type XII discrete distribution. Unfortunately, this one does not have a Scipy implementation (I see continuous only available), and generally I am unsure whether ngboost is compatible with discrete distributions in the first place.

Thanks again for the help, and the package generally!

alejandroschuler commented 4 years ago

@kmedved that would be much appreciated!

I did a little work just now to add some convenience methods to the normal and categorical distributions. For instance, to get the parameters for the estimated normal distributions of the first 5 test examples, you can now do:

Y_dists[0:5].params
>>> {'loc': array([20.16107254, 15.34757552,  9.66237628, 35.83424749, 36.05975884]),
 'scale': array([2.09291348, 2.0061713 , 2.60420611, 2.18444217, 2.20180082])}

or with the categorical distribution:

Y_dists[0:5].params
>>> {'p0': array([0.00224505, 0.98876268, 0.99668823, 0.98542018, 0.99777659]),
 'p1': array([0.99775495, 0.01123732, 0.00331177, 0.01457982, 0.00222341])}

I think using those methods makes more sense and is more consistent than the approach I laid out earlier, so if you want to use these in your vignette that would be great.

@anpatton this is probably of interest to you too.

alejandroschuler commented 4 years ago

@kmedved as for the distributions, could you point me to a source that clearly lays out the density (or mass) function and enumerates the parameters of each distribution? It would also be great to have sources with derivations of the gradient of the log likelihood (sometimes called the "score function") and the Fisher information matrix for each distribution.

Discrete distributions are not a problem for ngboost, as evidenced by our support of the categorical and Bernoulli distributions. As long as the parameters of the distribution can be varied continuously, or there is some parametrization in which they can be, then ngboost will work.

cooberp commented 4 years ago

Hi @alejandroschuler my name is Dan Rosenheck, and I am the editor of the data-journalism team at The Economist magazine. One of the biggest difficulties we've had in recent years is that we do a lot of probabilistic forecasting using Monte Carlo simulations, and most off-the-shelf machine-learning tools of continuous variables tend only to provide point predictions, not distributions. In the past, we've dealt with this in a very clunky fashion: we use a machine-learning model to produce a point estimate, and then use the resulting prediction as a feature in a separate (non-regularized, non-cross-validated, worryingly overfit) regression to produce the parameters of a distribution. So I was extremely excited when @kmedved told me about the ngboost package, which sounded like the exact one-stop-shop we've been yearning for. I'm the one who's been nudging him to ask you for help with adding new distributions.

As @kmedved alluded to above, there are two distributions we use frequently at The Economist. One is the skew-T, as defined by Azzalini in the "sn" package (http://azzalini.stat.unipd.it/SN). There is apparently a set of Python routines for it at http://azzalini.stat.unipd.it/SN/skew_normal-py.zip (we work in R, not Python), though I don't know whether they include the 4-parameter skew-T as well as the 3-parameter skew-normal. If you have functionality for the standard Student's T distribution (with a mean of 0), then calculating the probability-density function of the skew-T is a cinch. It has four parameters: xi, omega, alpha, and nu (omega and nu must always be positive). Below you can see an R implementation of the probability-density function of the skew-T distribution. dt(x, df = y) is the R function that gives the probability density of the Student's T distribution with y degrees of freedom at value x, and pt(x, df = y) is the R function that gives the cumulative distribution of the Student's T distribution with y degrees of freedom at value x.

dst = function(x, xi, omega, alpha, nu) {
  z = (x - xi)/omega
  pdf = dt(z, df = nu)
  cdf = pt(alpha * z * sqrt((nu + 1)/(z^2 + nu)), df = nu + 1)
  return(2 * pdf * cdf/omega)
}

The other distribution we use often is a very obscure one called the Discrete Burr Type XII. I believe it was only developed in the past few years by a guy named Para Bilal in South Asia (see http://medcraveonline.com/BBIJ/BBIJ-04-00092.pdf). It is an extremely flexible three-parameter distribution for non-negative discrete variables, and the only one I know of that can smoothly handle both under-dispersed and over-dispersed variables. It performs VASTLY better with distributions concentrated around small integers (e.g., 5% 0, 18% 1, 47% 2, 26% 3, 3% 4, 0.9% 5, 0.09% 6, etc.) than anything else I've been able to find. We currently use a slightly different parameterization than the one in Bilal's paper, one which the authors of the gamlss package in R kindly implemented at my request. It uses three parameters: mu, sigma, and nu. All of them most be positive. Its probability-mass function, as implemented in R, is simply ((1 + (x/mu)^sigma)^(-nu)) - ((1 + ((x+1)/mu)^sigma)^(-nu)).

As far as log-likelihood gradients, I once tried to fill out a Hessian matrix for the Discrete Burr Type XII's probability-mass function to feed into an optimization algorithm, and the calculus became completely intractable. The best guidance I can give you is the R code that one of the authors of gamlss, who knows FAR more math than I do, wrote in order to fit Discrete Burr Type XII regressions. I'm afraid I don't really know how to read it, but am pasting it here in the hope that you can decipher it! If not, I can definitely ask the author if he can help with the gradient and Fisher information matrix.

I really appreciate your willingness to consider adding new distributions to ngboost. It would be a gigantic step forward for our team if we could apply cutting-edge tree-based learning methods to our preferred parametric distributions. If you're able to get these distributions working, we would of course make sure to promote ngboost far and wide. Please let me know if you have any further questions. My email address is danrosenheck at economist dot com. Thanks very much!!

Dan

DBURR12 = function (mu.link = "log", sigma.link = "log", nu.link = "log") 
{
    mstats <- checklink("mu.link", "Discrete Burr XII", 
        substitute(mu.link), c("1/mu^2", "log", "identity"))
    dstats <- checklink("sigma.link", "Discrete Burr XII", 
        substitute(sigma.link), c("inverse", "log", 
            "identity"))
    vstats <- checklink("nu.link", "Discrete Burr XII", 
        substitute(nu.link), c("1/nu^2", "log", "identity"))
    structure(list(family = c("DBURR12", "Discrete Burr XII"), 
        parameters = list(mu = TRUE, sigma = TRUE, nu = TRUE), 
        nopar = 3, type = "Discrete", mu.link = as.character(substitute(mu.link)), 
        sigma.link = as.character(substitute(sigma.link)), nu.link = as.character(substitute(nu.link)), 
        mu.linkfun = mstats$linkfun, sigma.linkfun = dstats$linkfun, 
        nu.linkfun = vstats$linkfun, mu.linkinv = mstats$linkinv, 
        sigma.linkinv = dstats$linkinv, nu.linkinv = vstats$linkinv, 
        mu.dr = mstats$mu.eta, sigma.dr = dstats$mu.eta, nu.dr = vstats$mu.eta, 
        dldm = function(y, mu, sigma, nu) {
            Sur <- function(y) (1 + (y/mu)^sigma)^(-nu)
            dSdmu <- function(y) ifelse(y >= 1, sigma * nu * 
                ((y^sigma)/(mu^(sigma + 1))) * (1 + (y/mu)^sigma)^(-nu - 
                1), 0)
            dldm <- (dSdmu(y) - dSdmu(y + 1))/(Sur(y) - Sur(y + 
                1))
            dldm
        }, d2ldm2 = function(y, mu, sigma, nu) {
            Sur <- function(y) (1 + (y/mu)^sigma)^(-nu)
            dSdmu <- function(y) ifelse(y >= 1, sigma * nu * 
                ((y^sigma)/(mu^(sigma + 1))) * (1 + (y/mu)^sigma)^(-nu - 
                1), 0)
            dldm <- (dSdmu(y) - dSdmu(y + 1))/(Sur(y) - Sur(y + 
                1))
            -(dldm^2)
        }, dldd = function(y, mu, sigma, nu) {
            Sur <- function(y) (1 + (y/mu)^sigma)^(-nu)
            dSdsigma <- function(y) ifelse(y >= 1, -nu * log(y/mu) * 
                (y/mu)^sigma * (1 + (y/mu)^sigma)^(-nu - 1), 
                0)
            dldd <- (dSdsigma(y) - dSdsigma(y + 1))/(Sur(y) - 
                Sur(y + 1))
            dldd
        }, d2ldd2 = function(y, mu, sigma, nu) {
            Sur <- function(y) (1 + (y/mu)^sigma)^(-nu)
            dSdsigma <- function(y) ifelse(y >= 1, -nu * log(y/mu) * 
                (y/mu)^sigma * (1 + (y/mu)^sigma)^(-nu - 1), 
                0)
            dldd <- (dSdsigma(y) - dSdsigma(y + 1))/(Sur(y) - 
                Sur(y + 1))
            -(dldd^2)
        }, d2ldmdd = function(y, mu, sigma, nu) {
            Sur <- function(y) (1 + (y/mu)^sigma)^(-nu)
            dSdmu <- function(y) ifelse(y >= 1, sigma * nu * 
                ((y^sigma)/(mu^(sigma + 1))) * (1 + (y/mu)^sigma)^(-nu - 
                1), 0)
            dldm <- (dSdmu(y) - dSdmu(y + 1))/(Sur(y) - Sur(y + 
                1))
            dSdsigma <- function(y) ifelse(y >= 1, -nu * log(y/mu) * 
                (y/mu)^sigma * (1 + (y/mu)^sigma)^(-nu - 1), 
                0)
            dldd <- (dSdsigma(y) - dSdsigma(y + 1))/(Sur(y) - 
                Sur(y + 1))
            -(dldm * dldd)
        }, dldv = function(y, mu, sigma, nu) {
            Sur <- function(y) (1 + (y/mu)^sigma)^(-nu)
            dSdnu <- function(y) ifelse(y >= 1, -log(1 + (y/mu)^sigma) * 
                (1 + (y/mu)^sigma)^{
                  -nu
                }, 0)
            dldv <- (dSdnu(y) - dSdnu(y + 1))/(Sur(y) - Sur(y + 
                1))
            dldv
        }, d2ldv2 = function(y, mu, sigma, nu) {
            Sur <- function(y) (1 + (y/mu)^sigma)^(-nu)
            dSdnu <- function(y) ifelse(y >= 1, -log(1 + (y/mu)^sigma) * 
                (1 + (y/mu)^sigma)^{
                  -nu
                }, 0)
            dldv <- (dSdnu(y) - dSdnu(y + 1))/(Sur(y) - Sur(y + 
                1))
            -(dldv * dldv)
        }, d2ldmdv = function(y, mu, sigma, nu) {
            Sur <- function(y) (1 + (y/mu)^sigma)^(-nu)
            dSdmu <- function(y) ifelse(y >= 1, sigma * nu * 
                ((y^sigma)/(mu^(sigma + 1))) * (1 + (y/mu)^sigma)^(-nu - 
                1), 0)
            dldm <- (dSdmu(y) - dSdmu(y + 1))/(Sur(y) - Sur(y + 
                1))
            dSdnu <- function(y) ifelse(y >= 1, -log(1 + (y/mu)^sigma) * 
                (1 + (y/mu)^sigma)^{
                  -nu
                }, 0)
            dldv <- (dSdnu(y) - dSdnu(y + 1))/(Sur(y) - Sur(y + 
                1))
            -(dldm * dldv)
        }, d2ldddv = function(y, mu, sigma, nu) {
            Sur <- function(y) (1 + (y/mu)^sigma)^(-nu)
            dSdsigma <- function(y) ifelse(y >= 1, -nu * log(y/mu) * 
                (y/mu)^sigma * (1 + (y/mu)^sigma)^(-nu - 1), 
                0)
            dldd <- (dSdsigma(y) - dSdsigma(y + 1))/(Sur(y) - 
                Sur(y + 1))
            dSdnu <- function(y) ifelse(y >= 1, -log(1 + (y/mu)^sigma) * 
                (1 + (y/mu)^sigma)^{
                  -nu
                }, 0)
            dldv <- (dSdnu(y) - dSdnu(y + 1))/(Sur(y) - Sur(y + 
                1))
            -(dldd * dldv)
        }, G.dev.incr = function(y, mu, sigma, nu, ...) -2 * 
            dDBURR12(y, mu, sigma, nu, log = TRUE), rqres = expression(rqres(pfun = "pDBURR12", 
            type = "Discrete", ymin = 0, y = y, mu = mu, 
            sigma = sigma, nu = nu)), mu.initial = expression(mu <- (y + 
            (mean(y))/2)), sigma.initial = expression(sigma <- rep(2, 
            length(y))), nu.initial = expression(nu <- rep(2, 
            length(y))), mu.valid = function(mu) all(mu > 0), 
        sigma.valid = function(sigma) all(sigma > 0), nu.valid = function(nu) all(nu > 
            0), y.valid = function(y) all(y >= 0)), class = c("gamlss.family", 
        "family"))
}
anpatton commented 4 years ago

@cooberp I'm actually an R user as well, but can confirm NGBoost works flawlessly with reticulate. Also using it for MC related purposes.

alejandroschuler commented 4 years ago

@cooberp ok, let me give this a shot... Gradients will be easy, but I think deriving the Fisher info is going to be a mess for either of those. This actually brings up a good point though, which is that there should be a "default" method for approximating the Fisher info using MC sampling on the squared gradient. We can also bring back autograd as a default for the gradient computation. All of this would make it really easy for users to define their own distributions: just provide a density function (parametrized in terms of parameters in R) and a way to sample from the distribution and our code can do the rest. The cost of that is that the algorithm will run much more slowly than if the gradient and fisher info were hand-coded, but if someone needs the functionality and doesn't want to do a lot of math or wait for someone else to do it- seems like a good deal.

cooberp commented 4 years ago

That would definitely be a good backup plan. But that does sound like it would be glacially slow if it had to be run gazillions of times to fit a model. Let me check with some math-literate friends and colleagues to see if I can get analytic/closed-form answers for the Fisher matrix.

alejandroschuler commented 4 years ago

@cooberp sure thing. Note that I'd need the Fisher Info with respect to the log-parameters for any parameters that must be positive. Using log-parameters (which can be any real number) is the trick we use in ngboost to deal with parameters that can only take positive values.

Having the slow version would also let us verify all the explicit calculations since they should give the same answer.

alejandroschuler commented 4 years ago

Fyi I implemented this approach last night with the normal distribution to test it and it takes about a half second per boosting iteration on a dataset with a couple of hundred observations, so a couple of minutes per model. Obviously it would be slower for larger datasets, but you can also control a) the number of MC samples used to approximate the Fisher Info and b) The subsampling per boosting iteration (as usual). All in all, it should be slow, but probably not glacial.

cooberp commented 4 years ago

Yes, of course it would always be with the appropriate link function (log for positive, logit for unit interval).

The primary dataset I'm working on is around 500,000 observations, and I'd be looking at maybe 50 features. On a pretty powerful machine (16 CPU cores at 3.1 GHz/128GB RAM--is ngboost parallelized? sorry I don't know where to find that in the documentation), would we be looking at a fit time measured in hours, days, weeks, or months using this approach?

alejandroschuler commented 4 years ago

Probably hours, but I'll implement it and we can find out!

kmedved commented 4 years ago

@cooberp - I don't believe ngboost is parallelized, and gradient boosting doesn't usually lend itself well to parallelization.

alejandroschuler commented 4 years ago

Hi all in this thread, I wrote up an extended vignette that goes through most of the features of NGBoost. It's on a development branch for now since it calls on some features that aren't yet implemented in master, but you can see it as the readme here: https://github.com/stanfordmlgroup/ngboost/tree/survivial-regression-integration

Please have a look if you get a chance and tell me what you think is good and what needs more clarification!

kmedved commented 4 years ago

This documentation is wonderful. Thanks @alejandroschuler.

anpatton commented 4 years ago

Agree, this is extraordinarily helpful. Much appreciated.

On Mon, Feb 3, 2020 at 12:20 PM kmedved notifications@github.com wrote:

This documentation is wonderful. Thanks @alejandroschuler https://github.com/alejandroschuler.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stanfordmlgroup/ngboost/issues/60?email_source=notifications&email_token=AENZXKCYIDPMUBHD6SZVIYLRBBG5RA5CNFSM4KCY6LV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKUVGNI#issuecomment-581522229, or unsubscribe https://github.com/notifications/unsubscribe-auth/AENZXKFQT2LLASLQ74L3PZ3RBBG5RANCNFSM4KCY6LVQ .

alejandroschuler commented 4 years ago

Great, thanks for the feedback- glad it's helpful. I also added a bunch of docstrings to the user-facing methods, so I think things should be for the most part documented now. All of this should get merged into master shortly, so I'm going to close this issue. But please feel free to open specific issues for any new documentation you want or anything you find lacking!