Closed geektoni closed 4 years ago
@geektoni can you also update the cheat sheet using what you had shared in your pdf?
Looks like a great start ! Let us know when it's ready :)
Hello there. Sorry for the long delay. I have added the equation used for the Negative Binomial in the cheat sheet. I have also added the cdfast
implementation.
However, I am encountering some errors regarding inf
or nan
values (mostly caused by this guy here). I will check in the next days what is happening.
okay sounds good. Keep us posted !
Okay. The PR now contains both cdfast
and batch-gradient
methods for the Negative Binomial. From the tests, batch-gradient
seems to be working, while the cdfast
still has some issues related to nan
values which I do not know exactly how to solve :(
The issue message is the following:
Test glmnet. ... /home/uriel/Github/pyglmnet/pyglmnet/pyglmnet.py:336: RuntimeWarning: overflow encountered in divide
partial_beta_0_2 = grad_mu**2 * (y/mu**2 - (r+y)/(r+mu)**2)
/home/uriel/Github/pyglmnet/pyglmnet/pyglmnet.py:687: RuntimeWarning: invalid value encountered in multiply
update = 1. / hk * gk
/home/uriel/Github/pyglmnet/pyglmnet/pyglmnet.py:604: RuntimeWarning: invalid value encountered in greater
(np.abs(beta) > thresh)
This happens because when computing mu**2
there can be values which are closer to zero. Therefore, by squaring mu
, we get even smaller values which are treated as 0. That operation corresponds to the following part of the formula in the cheatsheet:
Any suggestions about this?
@geektoni it is quite possible you might have to abandon the clean chain rule approach sometimes. It's what we do for the binomial distribution. If you open the brackets in your equation, I think there might be some cancellation which could help avoid this issue of explosion. I can't say for sure, but it sounds similar to what we did for the binomial case -- opening the brackets and doing a little bit of mathematical algebra. Let us know if you're still stuck. I'll try to take out a pen and paper and see if I can help work it out.
Thanks @jasmainak. I'll try to see if I can come up with something. :)
I have another question, though. At the moment, the implementation uses a fixed shape parameter r
which is set to a predefined value. I would like to let the user choose that parameter but I am unsure about what could be the cleanest way to do it. The most naive approach would be adding a new argument to the several methods in order to specify it, but I do not quite like it. What do you think about it?
I think a clean solution would be nice, but probably out of the scope of this pull request. If you are up for it, I think what needs to be done is to have a statsmodel type approach. Basically, each distribution will be a class implementing log likelihood and the gradients. An object of this class can then be passed into the GLM init. This will be maximally flexible and clean, but would require some work. Also, it will make it easier for new contributors to add new distributions ...
Mmh yeah, I guess it is a bit too much for now. Maybe after this PR goes in I’ll try to refactor the various classes.
On Thu, 21 Feb 2019 at 22:08, Mainak Jas notifications@github.com wrote:
I think a clean solution would be nice, but probably out of the scope of this pull request. If you are up for it, I think what needs to be done is to have a statsmodel type approach. Basically, each distribution will be a class implementing log likelihood and the gradients. An object of this class can then be passed into the GLM init. This will be maximally flexible and clean, but would require some work. Also, it will make it easier for new contributors to add new distributions ...
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/glm-tools/pyglmnet/pull/274#issuecomment-466166920, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJNKKbS5KP4olADuDAiXLUQZyb45NGAks5vPwq9gaJpZM4Z1Tnf .
At the moment, no I’m not. I usually correct the code style towards the end of the implementation phase :)
On Sat, 23 Feb 2019 at 07:06, Mainak Jas notifications@github.com wrote:
@jasmainak commented on this pull request.
In pyglmnet/pyglmnet.py https://github.com/glm-tools/pyglmnet/pull/274#discussion_r259567781:
@@ -315,6 +325,19 @@ def _gradhess_logloss_1d(distr, xk, y, z, eta): hk = np.sum((y _probit_g5(z, pdfz, cdfz) + (1 - y) _probit_g6(z, pdfz, cdfz)) (xk xk))
- elif distr == 'neg-binomial':
- r = 15.
- mu = _mu(distr, z, eta)
- grad_mu = _grad_mu(distr, z, eta)
- hess_mu = np.exp(-z)*expit(z)**2
- partial_beta_0_1 = hess_mu*((r+y)/(r+mu)-(y/mu))
are you using a pep8 linter on your editor? This line should complain because you have missing whitespace around operators ...
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/glm-tools/pyglmnet/pull/274#pullrequestreview-207097891, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJNKNSU0myafG65Rb9bc7hcFOCXabKGks5vQNpWgaJpZM4Z1Tnf .
@geektoni did you check the CIs? There is a problem with importing the function from scipy
@geektoni did you check the CIs? There is a problem with importing the function from scipy
Ah yes, that is caused by the fact that I'm using directly the new loggamma
function from scipy which apparently appeared in later versions of the library. I'll change it by calling log
and gamma
separately.
@geektoni I am about to make a release of pyglmnet, I presume you haven't had time to get back to this? Anything I can do to help?
@jasmainak indeed many other things got in the way. As far as I remember, the code for the Negative Binomial is there, but it still gets nan
because of how the gradients are computed. I will try to get back to this in the next days to check what is left.
alright, thanks!
@geektoni thanks for your really nice contribution! i think the cdfast nan issue is likely to be solved once #348 is merged.
in the mean time, besides inline comments and pep8 checks as @jasmainak already pointed out, a few more things to address before merging:
doc/index.rst
, doc/tutorial.rst
and README.rst
in appropriate sections to add negative binomial as a distribution offered by the package (@jasmainak can we automatically populate this in various doc pages from ALLOWED_DISTRS
?)whats_new.rst
@jasmainak can we automatically populate this in various doc pages from ALLOWED_DISTRS?
I don't think it's worth the pain to figure this out. I'd suggest doing this manually until there is a real need.
Also we're missing tests, I think that is a must before merging this PR
Also we're missing tests, I think that is a must before merging this PR
@jasmainak since we parameterized tests by distribution, all distributions in ALLOWED_DISTRS
will be simulated and tested by default in test_glmnet
. we should of course add any neg-binomial
specific tests if appropriate
Okay, so it seems like that with #348 I don't have nan
s anymore in this PR :tada:
Okay, so it seems like that with #348 I don't have
nan
s anymore in this PR 🎉
great! were you able to verify this locally by fetching the PR #348 ?
Okay, so it seems like that with #348 I don't have
nan
s anymore in this PR tadagreat! were you able to verify this locally by fetching the PR #348 ?
Yes I was. I just need to fix the gradients to be consistent with numpy and it should be okay.
Let us know when you're ready @geektoni . If you rebase this branch with latest master, you should benefit from #348
@geektoni just checking in to see if you had a chance to make progress here. it's a significant contribution and close enough to being merged once numpy consistency, and a few other minor issues are addressed. let us know if you need help!
I've updated the loss/gradients to be consistent with numpy. Now, theta
represents the number of "success" cases. Since theta
will be fixed, I am still wondering what could be a reasonable "default" value for it. Tests are still failing because the results do not reach the required precision.
Merging #274 into master will decrease coverage by
0.09%
. The diff coverage is66.66%
.
@@ Coverage Diff @@
## master #274 +/- ##
==========================================
- Coverage 75.66% 75.56% -0.10%
==========================================
Files 4 4
Lines 678 704 +26
Branches 149 153 +4
==========================================
+ Hits 513 532 +19
- Misses 128 135 +7
Partials 37 37
Impacted Files | Coverage Δ | |
---|---|---|
pyglmnet/pyglmnet.py | 80.87% <66.66%> (-0.38%) |
:arrow_down: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update dc89e29...a984325. Read the comment docs.
@geektoni it seems to me that your derivatives are all correct. But there is some issue in the simulation code perhaps? You set theta = trials / 2
when simulating but it's y.mean()
when computing log likelihood. Is this what you intended?
@geektoni it seems to me that your derivatives are all correct. But there is some issue in the simulation code perhaps? You set
theta = trials / 2
when simulating but it'sy.mean()
when computing log likelihood. Is this what you intended?
I'm not sure but there may be an issue on this line here? https://github.com/glm-tools/pyglmnet/blob/af2d1ed47dfda6749b18e624b7be1f997e307d50/tests/test_pyglmnet.py#L189
I thought that the GLM was being tested by using some simulated data coming from the distributions. However, the simulate_glm
command is called by passing sample=False
. Therefore, we are not sampling from the distributions anymore but we are returning the conditional intensity function instead. See below:
https://github.com/glm-tools/pyglmnet/blob/af2d1ed47dfda6749b18e624b7be1f997e307d50/pyglmnet/pyglmnet.py#L397
Is this the intended behaviour or am I missing some details? :/
@geektoni just chat with @jasmainak, I will try to work on this PR over the weekend. Incoming.
@titipata sure! No problem. I'm a little busy these weeks but I'll try to help if you have questions!
@titipata if you can find a nice dataset / example to demo on, that would be great. There are some I found from the R community last time I searched.
@jasmainak @geektoni just have a quick question here. How I edit this PR? Cloning from geektoni
fork? Do I need access to geektoni
fork repo?
you have admin access so you should be able to push directly to geektoni
's branches which are PRs on pyglmnet
$ git remote add geektoni git@github.com:geektoni/negative-binomial
$ git fetch geektoni negative-binomial:negative-binomial
$ git checkout negative-binomial
once you are done working on the branch:
$ git push geektoni negative-binomial
@jasmainak @geektoni, I tried an example from an example here: https://stats.idre.ucla.edu/r/dae/negative-binomial-regression/
##########################################################
#
# GLM with negative binomial distribution
#
# This is an example of GLM with negative binomial distribution.
# In this example, we will be using an example from R community below
# https://stats.idre.ucla.edu/r/dae/negative-binomial-regression/
#
# Here, we would like to predict the days absense of high school juniors
# at two schools from there type of program they are enrolled, and their math
# score.
##########################################################
# Import relevance libraries
import pandas as pd
from pyglmnet import GLM
import matplotlib.pyplot as plt
##########################################################
# Read and preprocess data
df = pd.read_stata("https://stats.idre.ucla.edu/stat/stata/dae/nb_data.dta")
print(df.head())
# histogram of type of program they are enrolled
df.hist(column='daysabs', by=['prog'])
plt.show()
# print mean and standard deviation for each program enrolled
df.groupby('prog').agg({'daysabs': ['mean', 'std']})
##########################################################
# Feature
X = df.drop('daysabs', axis=1)
y = df['daysabs'].values
# design matrix
program_df = pd.get_dummies(df.prog)
Xdsgn = pd.concat((df['math'], program_df.drop(3.0, axis=1)), axis=1).values
##########################################################
# Predict using GLM
glm_neg_bino = GLM(distr='neg-binomial',
alpha=0.0,
score_metric='pseudo_R2')
glm_neg_bino.fit(Xdsgn, y)
print(glm_neg_bino.beta0_, glm_neg_bino.beta_)
print(glm_neg_bino.score(Xdsgn, y))
8.301077895169326 [2.23539901 0. 0. ]
nan
Somehow score
I got is not quite right.
But with the following example, it works kinda fine:
import numpy as np
from pyglmnet import simulate_glm, GLM
import matplotlib.pyplot as plt
n_samples = 1000
n_features = 100
beta = np.random.random(n_features)
Xtrain = np.random.normal(0.0, 1.0, [n_samples, n_features])
ytrain = simulate_glm('neg-binomial', 10., beta, Xtrain)
glm = GLM(distr='neg-binomial', alpha=0)
glm.fit(Xtrain, ytrain)
plt.plot(np.arange(n_features), beta)
plt.plot(np.arange(n_features), glm.beta_)
print(glm.score(Xtrain, ytrain))
-3939.3125900123405
We might have to check the score function? Also, check if the fit is aligned with the R library?
We can also use the data from https://github.com/atahk/pscl/tree/master/vignettes, which is from https://cran.r-project.org/web/packages/pscl/vignettes/countreg.pdf. Let me know if anyone has any preferences.
can you try to exactly reproduce what they have done -- use the same features etc? Since we have compared the rest our results against scikit-learn
, this should also work if the Math is correctly done.
Sure, will try that tomorrow.
Also more resources:
@jasmainak so I edited and tried an example above (please check). I don't get the same result as in R. I put the result from R below
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.61527 0.19746 13.24 < 2e-16 ***
## math -0.00599 0.00251 -2.39 0.017 *
## progAcademic -0.44076 0.18261 -2.41 0.016 *
## progVocational -1.27865 0.20072 -6.37 1.9e-10 ***
math
should have a very low coefficient and prog
should have more effect.
any idea why?
could you share exactly what you did so @geektoni can pitch in?
any idea why?
could you share exactly what you did so @geektoni can pitch in?
@titipata if you could attach here the R code and the pyglmnet code you used to generate those result it would be perfect! I guess the dataset is this one here https://stats.idre.ucla.edu/r/dae/negative-binomial-regression/
@geektoni @jasmainak the code for GLM is at https://github.com/glm-tools/pyglmnet/pull/274#issuecomment-596160343. Not sure it fails since I didn't check the code and math yet.
For the R code, I just follow the code via https://stats.idre.ucla.edu/r/dae/negative-binomial-regression/.
require(foreign)
require(ggplot2)
require(MASS)
dat <- read.dta("https://stats.idre.ucla.edu/stat/stata/dae/nb_data.dta")
dat <- within(dat, {
prog <- factor(prog, levels = 1:3, labels = c("General", "Academic", "Vocational"))
id <- factor(id)
})
ggplot(dat, aes(daysabs, fill = prog)) + geom_histogram(binwidth = 1) + facet_grid(prog ~
., margins = TRUE, scales = "free")
with(dat, tapply(daysabs, prog, function(x) {
sprintf("M (SD) = %1.2f (%1.2f)", mean(x), sd(x))
}))
summary(m1 <- glm.nb(daysabs ~ math + prog, data = dat))
Output
glm.nb(formula = daysabs ~ math + prog, data = dat, init.theta = 1.032713156,
link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1547 -1.0192 -0.3694 0.2285 2.5273
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.615265 0.197460 13.245 < 2e-16
math -0.005993 0.002505 -2.392 0.0167
progAcademic -0.440760 0.182610 -2.414 0.0158
progVocational -1.278651 0.200720 -6.370 1.89e-10
Correctly, GLM('neg-binomial')
supposes to give 2.615265
, [-0.005993, -0.440760, -1.278651]
as an output (for glm_neg_bino.beta0_
, glm_neg_bino.beta_
respectively).
I will get back to this in the next few days.
Thanks a lot @geektoni !
I've updated the derivations of the Negative Binomial because there were some little mistakes. I also tried to run again the code of @titipata (see comment https://github.com/glm-tools/pyglmnet/pull/274#issuecomment-596160343) and I've got the following results, which still seems to be quite off.
5.630471875761383 [10.59792779 0.04525441 0.12164745]
-1.1161086627087808
This could be caused by the fact that we assume that the parameter theta
(number of successes) is given a priori and it is not estimated from the data.
I've also tried some code on the line of https://github.com/glm-tools/pyglmnet/pull/274#issuecomment-596160552, but the results seem not quite right.
My guess is that the derivations still have some issues (maybe when they are translated into code or when I derived them manually).
@jasmainak @titipata Okay, I've done some more testing and actually it seems that the negative binomial works quite fine now! :tada:
I've used the code below and I trained the model on the simulate_glm
data.
import numpy as np
from pyglmnet import simulate_glm, GLM
import matplotlib.pyplot as plt
np.random.seed(42)
n_samples = 1000
n_features = 5
beta = np.random.random(n_features)
Xtrain = np.random.normal(0.0, 1.0, [n_samples, n_features])
ytrain = simulate_glm('neg-binomial', 1.0, beta, Xtrain, fit_intercept=True)
# Train the model
glm = GLM(
distr='neg-binomial',
alpha=0.01,
score_metric='pseudo_R2',
theta=5,
max_iter=1000,
fit_intercept=True)
glm.fit(Xtrain, ytrain)
# Print the betas
print(beta, 1.0)
print(glm.beta_, glm.beta0_)
# Generate test data
Xtest = np.random.normal(0.0, 1.0, [n_samples, n_features])
ytest = simulate_glm('neg-binomial', 0.0, beta, Xtrain, fit_intercept=True)
print(glm.score(Xtest, ytest))
# Print the plot
fig, axs = plt.subplots(1)
axs.hist([glm.predict(Xtrain).flatten(), ytrain], label=["predict", "true"])
axs.legend()
plt.show()
Here are the terminal outputs of the script (plus the data distribution).
[0.37454012 0.95071431 0.73199394 0.59865848 0.15601864] 1.0
[0.23202493 0.63894841 0.49310422 0.38085685 0.0881567 ] 0.9942570869860101
-0.018872965358065263
@geektoni awesome! I tested a given example and it works for me! However, I still got a different outcome on an example in my comment somehow. I'm not sure why but the derivation looks good to me. Maybe, this is a problem from the optimizer?
@geektoni awesome! I tested a given example and it works for me! However, I still got a different outcome on an example in my comment somehow. I'm not sure why but the derivation looks good to me. Maybe, this is a problem from the optimizer?
I think it could depend on which value of theta
you chose to train the model. Here theta
is assumed to be known and it is not learned (theta
represents the "number of successes").
@geektoni I tried multiple theta
and try to compare the y_predict
with y
on the given example. I still have no chance of getting it right. Can you try to see if it works somehow?
@geektoni I tried multiple
theta
and try to compare they_predict
withy
on the given example. I still have no chance of getting it right. Can you try to see if it works somehow?
In the past days, I tried to obtain the same results of the example you posted a while ago. However, I am still not able to reach the same results :( I am puzzled.
Maybe @jasmainak @pavanramkumar could have better suggestions about it?
I am sorry for my long silence. I was overwhelmed by grant writing business.
My approach would be to first check if the R code gives comparable results for other distributions. Was this ever checked @pavanramkumar ? If it does, then we can consider the comparison for this distribution valid.
Please feel free to add me on google hangouts @geektoni . Sometimes it's easier to chat privately and try to converge.
closes #163
Thecdfast
is still missing.