glm-tools / pyglmnet

Python implementation of elastic-net regularized generalized linear models
http://glm-tools.github.io/pyglmnet/
MIT License
279 stars 83 forks source link

[MRG] Add Negative Binomial distribution (see issue #163). #274

Closed geektoni closed 4 years ago

geektoni commented 5 years ago

closes #163

The cdfast is still missing.

jasmainak commented 5 years ago

@geektoni can you also update the cheat sheet using what you had shared in your pdf?

Looks like a great start ! Let us know when it's ready :)

geektoni commented 5 years ago

Hello there. Sorry for the long delay. I have added the equation used for the Negative Binomial in the cheat sheet. I have also added the cdfast implementation.

However, I am encountering some errors regarding inf or nan values (mostly caused by this guy here). I will check in the next days what is happening.

jasmainak commented 5 years ago

okay sounds good. Keep us posted !

geektoni commented 5 years ago

Okay. The PR now contains both cdfast and batch-gradient methods for the Negative Binomial. From the tests, batch-gradient seems to be working, while the cdfast still has some issues related to nan values which I do not know exactly how to solve :(

The issue message is the following:

Test glmnet. ... /home/uriel/Github/pyglmnet/pyglmnet/pyglmnet.py:336: RuntimeWarning: overflow encountered in divide
  partial_beta_0_2 = grad_mu**2 * (y/mu**2 - (r+y)/(r+mu)**2)
/home/uriel/Github/pyglmnet/pyglmnet/pyglmnet.py:687: RuntimeWarning: invalid value encountered in multiply
  update = 1. / hk * gk
/home/uriel/Github/pyglmnet/pyglmnet/pyglmnet.py:604: RuntimeWarning: invalid value encountered in greater
  (np.abs(beta) > thresh)

This happens because when computing mu**2 there can be values which are closer to zero. Therefore, by squaring mu, we get even smaller values which are treated as 0. That operation corresponds to the following part of the formula in the cheatsheet: screen

Any suggestions about this?

jasmainak commented 5 years ago

@geektoni it is quite possible you might have to abandon the clean chain rule approach sometimes. It's what we do for the binomial distribution. If you open the brackets in your equation, I think there might be some cancellation which could help avoid this issue of explosion. I can't say for sure, but it sounds similar to what we did for the binomial case -- opening the brackets and doing a little bit of mathematical algebra. Let us know if you're still stuck. I'll try to take out a pen and paper and see if I can help work it out.

geektoni commented 5 years ago

Thanks @jasmainak. I'll try to see if I can come up with something. :)

I have another question, though. At the moment, the implementation uses a fixed shape parameter r which is set to a predefined value. I would like to let the user choose that parameter but I am unsure about what could be the cleanest way to do it. The most naive approach would be adding a new argument to the several methods in order to specify it, but I do not quite like it. What do you think about it?

jasmainak commented 5 years ago

I think a clean solution would be nice, but probably out of the scope of this pull request. If you are up for it, I think what needs to be done is to have a statsmodel type approach. Basically, each distribution will be a class implementing log likelihood and the gradients. An object of this class can then be passed into the GLM init. This will be maximally flexible and clean, but would require some work. Also, it will make it easier for new contributors to add new distributions ...

geektoni commented 5 years ago

Mmh yeah, I guess it is a bit too much for now. Maybe after this PR goes in I’ll try to refactor the various classes.

On Thu, 21 Feb 2019 at 22:08, Mainak Jas notifications@github.com wrote:

I think a clean solution would be nice, but probably out of the scope of this pull request. If you are up for it, I think what needs to be done is to have a statsmodel type approach. Basically, each distribution will be a class implementing log likelihood and the gradients. An object of this class can then be passed into the GLM init. This will be maximally flexible and clean, but would require some work. Also, it will make it easier for new contributors to add new distributions ...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/glm-tools/pyglmnet/pull/274#issuecomment-466166920, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJNKKbS5KP4olADuDAiXLUQZyb45NGAks5vPwq9gaJpZM4Z1Tnf .

geektoni commented 5 years ago

At the moment, no I’m not. I usually correct the code style towards the end of the implementation phase :)

On Sat, 23 Feb 2019 at 07:06, Mainak Jas notifications@github.com wrote:

@jasmainak commented on this pull request.

In pyglmnet/pyglmnet.py https://github.com/glm-tools/pyglmnet/pull/274#discussion_r259567781:

@@ -315,6 +325,19 @@ def _gradhess_logloss_1d(distr, xk, y, z, eta): hk = np.sum((y _probit_g5(z, pdfz, cdfz) + (1 - y) _probit_g6(z, pdfz, cdfz)) (xk xk))

  • elif distr == 'neg-binomial':
  • r = 15.
  • mu = _mu(distr, z, eta)
  • grad_mu = _grad_mu(distr, z, eta)
  • hess_mu = np.exp(-z)*expit(z)**2
  • partial_beta_0_1 = hess_mu*((r+y)/(r+mu)-(y/mu))

are you using a pep8 linter on your editor? This line should complain because you have missing whitespace around operators ...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/glm-tools/pyglmnet/pull/274#pullrequestreview-207097891, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJNKNSU0myafG65Rb9bc7hcFOCXabKGks5vQNpWgaJpZM4Z1Tnf .

jasmainak commented 5 years ago

@geektoni did you check the CIs? There is a problem with importing the function from scipy

geektoni commented 5 years ago

@geektoni did you check the CIs? There is a problem with importing the function from scipy

Ah yes, that is caused by the fact that I'm using directly the new loggamma function from scipy which apparently appeared in later versions of the library. I'll change it by calling log and gamma separately.

jasmainak commented 4 years ago

@geektoni I am about to make a release of pyglmnet, I presume you haven't had time to get back to this? Anything I can do to help?

geektoni commented 4 years ago

@jasmainak indeed many other things got in the way. As far as I remember, the code for the Negative Binomial is there, but it still gets nan because of how the gradients are computed. I will try to get back to this in the next days to check what is left.

jasmainak commented 4 years ago

alright, thanks!

pavanramkumar commented 4 years ago

@geektoni thanks for your really nice contribution! i think the cdfast nan issue is likely to be solved once #348 is merged.

in the mean time, besides inline comments and pep8 checks as @jasmainak already pointed out, a few more things to address before merging:

jasmainak commented 4 years ago

@jasmainak can we automatically populate this in various doc pages from ALLOWED_DISTRS?

I don't think it's worth the pain to figure this out. I'd suggest doing this manually until there is a real need.

jasmainak commented 4 years ago

Also we're missing tests, I think that is a must before merging this PR

pavanramkumar commented 4 years ago

Also we're missing tests, I think that is a must before merging this PR

@jasmainak since we parameterized tests by distribution, all distributions in ALLOWED_DISTRS will be simulated and tested by default in test_glmnet. we should of course add any neg-binomial specific tests if appropriate

geektoni commented 4 years ago

Okay, so it seems like that with #348 I don't have nans anymore in this PR :tada:

pavanramkumar commented 4 years ago

Okay, so it seems like that with #348 I don't have nans anymore in this PR 🎉

great! were you able to verify this locally by fetching the PR #348 ?

geektoni commented 4 years ago

Okay, so it seems like that with #348 I don't have nans anymore in this PR tada

great! were you able to verify this locally by fetching the PR #348 ?

Yes I was. I just need to fix the gradients to be consistent with numpy and it should be okay.

jasmainak commented 4 years ago

Let us know when you're ready @geektoni . If you rebase this branch with latest master, you should benefit from #348

pavanramkumar commented 4 years ago

@geektoni just checking in to see if you had a chance to make progress here. it's a significant contribution and close enough to being merged once numpy consistency, and a few other minor issues are addressed. let us know if you need help!

geektoni commented 4 years ago

I've updated the loss/gradients to be consistent with numpy. Now, theta represents the number of "success" cases. Since theta will be fixed, I am still wondering what could be a reasonable "default" value for it. Tests are still failing because the results do not reach the required precision.

codecov-io commented 4 years ago

Codecov Report

Merging #274 into master will decrease coverage by 0.09%. The diff coverage is 66.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #274      +/-   ##
==========================================
- Coverage   75.66%   75.56%   -0.10%     
==========================================
  Files           4        4              
  Lines         678      704      +26     
  Branches      149      153       +4     
==========================================
+ Hits          513      532      +19     
- Misses        128      135       +7     
  Partials       37       37              
Impacted Files Coverage Δ
pyglmnet/pyglmnet.py 80.87% <66.66%> (-0.38%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update dc89e29...a984325. Read the comment docs.

jasmainak commented 4 years ago

@geektoni it seems to me that your derivatives are all correct. But there is some issue in the simulation code perhaps? You set theta = trials / 2 when simulating but it's y.mean() when computing log likelihood. Is this what you intended?

geektoni commented 4 years ago

@geektoni it seems to me that your derivatives are all correct. But there is some issue in the simulation code perhaps? You set theta = trials / 2 when simulating but it's y.mean() when computing log likelihood. Is this what you intended?

I'm not sure but there may be an issue on this line here? https://github.com/glm-tools/pyglmnet/blob/af2d1ed47dfda6749b18e624b7be1f997e307d50/tests/test_pyglmnet.py#L189

I thought that the GLM was being tested by using some simulated data coming from the distributions. However, the simulate_glm command is called by passing sample=False. Therefore, we are not sampling from the distributions anymore but we are returning the conditional intensity function instead. See below: https://github.com/glm-tools/pyglmnet/blob/af2d1ed47dfda6749b18e624b7be1f997e307d50/pyglmnet/pyglmnet.py#L397

Is this the intended behaviour or am I missing some details? :/

titipata commented 4 years ago

@geektoni just chat with @jasmainak, I will try to work on this PR over the weekend. Incoming.

geektoni commented 4 years ago

@titipata sure! No problem. I'm a little busy these weeks but I'll try to help if you have questions!

jasmainak commented 4 years ago

@titipata if you can find a nice dataset / example to demo on, that would be great. There are some I found from the R community last time I searched.

titipata commented 4 years ago

@jasmainak @geektoni just have a quick question here. How I edit this PR? Cloning from geektoni fork? Do I need access to geektoni fork repo?

jasmainak commented 4 years ago

you have admin access so you should be able to push directly to geektoni's branches which are PRs on pyglmnet

$ git remote add geektoni git@github.com:geektoni/negative-binomial
$ git fetch geektoni negative-binomial:negative-binomial
$ git checkout negative-binomial

once you are done working on the branch:

$ git push geektoni negative-binomial
titipata commented 4 years ago

@jasmainak @geektoni, I tried an example from an example here: https://stats.idre.ucla.edu/r/dae/negative-binomial-regression/

##########################################################
#
# GLM with negative binomial distribution
#
# This is an example of GLM with negative binomial distribution.
# In this example, we will be using an example from R community below
# https://stats.idre.ucla.edu/r/dae/negative-binomial-regression/
#
# Here, we would like to predict the days absense of high school juniors
# at two schools from there type of program they are enrolled, and their math
# score.

##########################################################
# Import relevance libraries

import pandas as pd
from pyglmnet import GLM

import matplotlib.pyplot as plt

##########################################################
# Read and preprocess data

df = pd.read_stata("https://stats.idre.ucla.edu/stat/stata/dae/nb_data.dta")
print(df.head())

# histogram of type of program they are enrolled
df.hist(column='daysabs', by=['prog'])
plt.show()

# print mean and standard deviation for each program enrolled
df.groupby('prog').agg({'daysabs': ['mean', 'std']})

##########################################################
# Feature

X = df.drop('daysabs', axis=1)
y = df['daysabs'].values

# design matrix
program_df = pd.get_dummies(df.prog)
Xdsgn = pd.concat((df['math'], program_df.drop(3.0, axis=1)), axis=1).values

##########################################################
# Predict using GLM

glm_neg_bino = GLM(distr='neg-binomial',
                   alpha=0.0,
                   score_metric='pseudo_R2')
glm_neg_bino.fit(Xdsgn, y)
print(glm_neg_bino.beta0_, glm_neg_bino.beta_)
print(glm_neg_bino.score(Xdsgn, y))
8.301077895169326 [2.23539901 0.         0.        ]
nan

Somehow score I got is not quite right.

titipata commented 4 years ago

But with the following example, it works kinda fine:

import numpy as np
from pyglmnet import simulate_glm, GLM
import matplotlib.pyplot as plt

n_samples = 1000
n_features = 100
beta = np.random.random(n_features)
Xtrain = np.random.normal(0.0, 1.0, [n_samples, n_features])
ytrain = simulate_glm('neg-binomial', 10., beta, Xtrain)

glm = GLM(distr='neg-binomial', alpha=0)
glm.fit(Xtrain, ytrain)

plt.plot(np.arange(n_features), beta)
plt.plot(np.arange(n_features), glm.beta_)
print(glm.score(Xtrain, ytrain))
-3939.3125900123405

We might have to check the score function? Also, check if the fit is aligned with the R library?

titipata commented 4 years ago

We can also use the data from https://github.com/atahk/pscl/tree/master/vignettes, which is from https://cran.r-project.org/web/packages/pscl/vignettes/countreg.pdf. Let me know if anyone has any preferences.

jasmainak commented 4 years ago

can you try to exactly reproduce what they have done -- use the same features etc? Since we have compared the rest our results against scikit-learn, this should also work if the Math is correctly done.

titipata commented 4 years ago

Sure, will try that tomorrow.

Also more resources:

titipata commented 4 years ago

@jasmainak so I edited and tried an example above (please check). I don't get the same result as in R. I put the result from R below

## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     2.61527    0.19746   13.24  < 2e-16 ***
## math           -0.00599    0.00251   -2.39    0.017 *  
## progAcademic   -0.44076    0.18261   -2.41    0.016 *  
## progVocational -1.27865    0.20072   -6.37  1.9e-10 ***

math should have a very low coefficient and prog should have more effect.

jasmainak commented 4 years ago

any idea why?

could you share exactly what you did so @geektoni can pitch in?

geektoni commented 4 years ago

any idea why?

could you share exactly what you did so @geektoni can pitch in?

@titipata if you could attach here the R code and the pyglmnet code you used to generate those result it would be perfect! I guess the dataset is this one here https://stats.idre.ucla.edu/r/dae/negative-binomial-regression/

titipata commented 4 years ago

@geektoni @jasmainak the code for GLM is at https://github.com/glm-tools/pyglmnet/pull/274#issuecomment-596160343. Not sure it fails since I didn't check the code and math yet.

For the R code, I just follow the code via https://stats.idre.ucla.edu/r/dae/negative-binomial-regression/.

require(foreign)
require(ggplot2)
require(MASS)

dat <- read.dta("https://stats.idre.ucla.edu/stat/stata/dae/nb_data.dta")
dat <- within(dat, {
    prog <- factor(prog, levels = 1:3, labels = c("General", "Academic", "Vocational"))
    id <- factor(id)
})

ggplot(dat, aes(daysabs, fill = prog)) + geom_histogram(binwidth = 1) + facet_grid(prog ~ 
    ., margins = TRUE, scales = "free")

with(dat, tapply(daysabs, prog, function(x) {
    sprintf("M (SD) = %1.2f (%1.2f)", mean(x), sd(x))
}))

summary(m1 <- glm.nb(daysabs ~ math + prog, data = dat))

Output

glm.nb(formula = daysabs ~ math + prog, data = dat, init.theta = 1.032713156, 
    link = log)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1547  -1.0192  -0.3694   0.2285   2.5273  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)
(Intercept)     2.615265   0.197460  13.245  < 2e-16
math           -0.005993   0.002505  -2.392   0.0167
progAcademic   -0.440760   0.182610  -2.414   0.0158
progVocational -1.278651   0.200720  -6.370 1.89e-10

Correctly, GLM('neg-binomial') supposes to give 2.615265, [-0.005993, -0.440760, -1.278651] as an output (for glm_neg_bino.beta0_, glm_neg_bino.beta_ respectively).

geektoni commented 4 years ago

I will get back to this in the next few days.

jasmainak commented 4 years ago

Thanks a lot @geektoni !

geektoni commented 4 years ago

I've updated the derivations of the Negative Binomial because there were some little mistakes. I also tried to run again the code of @titipata (see comment https://github.com/glm-tools/pyglmnet/pull/274#issuecomment-596160343) and I've got the following results, which still seems to be quite off.

5.630471875761383 [10.59792779  0.04525441  0.12164745]
-1.1161086627087808

This could be caused by the fact that we assume that the parameter theta (number of successes) is given a priori and it is not estimated from the data.

I've also tried some code on the line of https://github.com/glm-tools/pyglmnet/pull/274#issuecomment-596160552, but the results seem not quite right.

My guess is that the derivations still have some issues (maybe when they are translated into code or when I derived them manually).

geektoni commented 4 years ago

@jasmainak @titipata Okay, I've done some more testing and actually it seems that the negative binomial works quite fine now! :tada:

I've used the code below and I trained the model on the simulate_glm data.

import numpy as np
from pyglmnet import simulate_glm, GLM
import matplotlib.pyplot as plt

np.random.seed(42)

n_samples = 1000
n_features = 5
beta = np.random.random(n_features)
Xtrain = np.random.normal(0.0, 1.0, [n_samples, n_features])
ytrain = simulate_glm('neg-binomial', 1.0, beta, Xtrain, fit_intercept=True)

# Train the model
glm = GLM(
    distr='neg-binomial',
    alpha=0.01,
    score_metric='pseudo_R2',
    theta=5,
    max_iter=1000,
    fit_intercept=True)
glm.fit(Xtrain, ytrain)

# Print the betas
print(beta, 1.0)
print(glm.beta_, glm.beta0_)

# Generate test data
Xtest = np.random.normal(0.0, 1.0, [n_samples, n_features])
ytest = simulate_glm('neg-binomial', 0.0, beta, Xtrain, fit_intercept=True)

print(glm.score(Xtest, ytest))

# Print the plot
fig, axs = plt.subplots(1)
axs.hist([glm.predict(Xtrain).flatten(), ytrain], label=["predict", "true"])
axs.legend()

plt.show()

Here are the terminal outputs of the script (plus the data distribution).

[0.37454012 0.95071431 0.73199394 0.59865848 0.15601864] 1.0
[0.23202493 0.63894841 0.49310422 0.38085685 0.0881567 ] 0.9942570869860101
-0.018872965358065263

Figure_1

titipata commented 4 years ago

@geektoni awesome! I tested a given example and it works for me! However, I still got a different outcome on an example in my comment somehow. I'm not sure why but the derivation looks good to me. Maybe, this is a problem from the optimizer?

geektoni commented 4 years ago

@geektoni awesome! I tested a given example and it works for me! However, I still got a different outcome on an example in my comment somehow. I'm not sure why but the derivation looks good to me. Maybe, this is a problem from the optimizer?

I think it could depend on which value of theta you chose to train the model. Here theta is assumed to be known and it is not learned (theta represents the "number of successes").

titipata commented 4 years ago

@geektoni I tried multiple theta and try to compare the y_predict with y on the given example. I still have no chance of getting it right. Can you try to see if it works somehow?

geektoni commented 4 years ago

@geektoni I tried multiple theta and try to compare the y_predict with y on the given example. I still have no chance of getting it right. Can you try to see if it works somehow?

In the past days, I tried to obtain the same results of the example you posted a while ago. However, I am still not able to reach the same results :( I am puzzled.

Maybe @jasmainak @pavanramkumar could have better suggestions about it?

jasmainak commented 4 years ago

I am sorry for my long silence. I was overwhelmed by grant writing business.

My approach would be to first check if the R code gives comparable results for other distributions. Was this ever checked @pavanramkumar ? If it does, then we can consider the comparison for this distribution valid.

Please feel free to add me on google hangouts @geektoni . Sometimes it's easier to chat privately and try to converge.