Closed Tianbo-Diao closed 5 months ago
I gave a quick look a the implementation of the Poisson datafit, at first glance it seems that $0$ value for $y$ should not be a problem.
Are you aware of specific problems/different formulas when the target $y$ takes value $0$? (I ran into this, which links to this).
If not we can switch the np.any(y <= 0)
to np.any(y < 0)
(https://github.com/scikit-learn-contrib/skglm/blob/8aea6117db5f890b1b28af876ff3dfaefd3723b0/skglm/datafits/single_task.py#L435)
Hello, the email has been received and will be dealt with as soon as possible.
Thanks for your reply, as you said, poisson regression encounters this if the loss function (negative of the maximum likelihood function) used is $\frac{1}{n} L(\boldsymbol{\beta}) = \frac{1}{n} \sum_{i \leq n} \left[ e^{ {x}_i^{\top} {\beta} } - ({x}_i^{\top} {\beta}) y_i \right]$ (I wrote a similar program before and tested it with this hint):
RuntimeWarning: overflow encountered in exp
return ( np.sum(np.exp(self.X @ beta)) - self.Y.T @ self.X @ beta ) /self.n
Thus, taking the logarithm may be a better choice, but then you have to deal with the case where the response variable is zero.
I tried to follow your suggestion: switch the np.any(y <= 0) to np.any(y < 0) (https://github.com/scikit-learn-contrib/skglm/blob/8aea6117db5f890b1b28af876ff3dfaefd3723b0/skglm/datafits/single_task.py#L435) and a new problem arose:
File "D:\Python\Lib\site-packages\skglm\solvers\anderson_cd.py", line 83, in solve
lipschitz = datafit.get_lipschitz(X, y)
^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'Poisson' object has no attribute 'get_lipschitz'
I forgot the code to generate the data, it's in the simulator file as follows:
import numpy as np
rng = np.random.default_rng()
def simu_X(n, p, example):
if example == 'iid':
return rng.normal(size=(n, p))
elif example == 'hete':
mean_vec = np.zeros(p)
cov_mat = 0.5 ** np.abs( np.arange(1, p + 1).reshape(-1,1) - np.arange(1, p + 1) )
return rng.multivariate_normal(mean_vec, cov_mat, n, method='cholesky')
def simu_Y(X, n, beta, type):
match type:
case 'linear':
return X @ beta + rng.normal(size=n)
case 'logistic':
probe = 1 / (1 + np.exp(-X @ beta))
return rng.binomial(n=1, p=probe, size=n)
case 'poisson':
probe = np.exp(X @ beta)
return rng.poisson(lam=probe, size=n)
In using skglm for generalized linear modeling, it seems that the accuracy of the linear regression results is OK, but the accuracy of the logistic regression and poisson regression results are poorer, this phenomenon is also found in the glmnet package and the ncvreg package in R. Can there be a better explanation?
Thanks again for your reply.
@Tianbo-Diao could you edit your messages to format your code with triple backticks ``` so that it renders properly, and identation is not lost ? Like this
def f():
return 1
You can also use the code button above the message box :
I'm sorry if I've caused you any trouble by not being able to use submit issue proficiently.
The code runs the file main
:
import numpy as np
from time import time
from skglm import GeneralizedLinearEstimator
from skglm import datafits
from skglm import penalties
# loading other Python files
import simulator
n = 200
p = 5000
beta_value = np.array([5, 3, 0, 0, -2, 0] + [0] * (p - 6))
# For other scenarios, uncomment and use appropriate beta values:
# iid: np.array([0, 1.8, -2, 0, 1.4, 0] + [0] * (p - 6))
# hete: np.array([5, 3, 0, 0, -2, 0] + [0] * (p - 6))
# simulate data
start_time = time()
popu_X = simulator.simu_X(n, p, example='hete')
popu_Y = simulator.simu_Y(popu_X, n, beta=beta_value, type='poisson')
print('simulate data time:', time() - start_time)
# start_time = time()
result = GeneralizedLinearEstimator(datafit= datafits.Poisson(), penalty=penalties.L1(alpha=1)).fit(X = popu_X, y = popu_Y).coef_
print('processing time:', time() - start_time)
print( np.where(result !=0)[0] )
print( result[np.where(result !=0)[0]] )
Generate data for the file simulator
:
import numpy as np
rng = np.random.default_rng()
def simu_X(n, p, example):
if example == 'iid':
return rng.normal(size=(n, p))
elif example == 'hete':
mean_vec = np.zeros(p)
cov_mat = 0.5 ** np.abs( np.arange(1, p + 1).reshape(-1,1) - np.arange(1, p + 1) )
return rng.multivariate_normal(mean_vec, cov_mat, n, method='cholesky')
def simu_Y(X, n, beta, type):
match type:
case 'linear':
return X @ beta + rng.normal(size=n)
case 'logistic':
probe = 1 / (1 + np.exp(-X @ beta))
return rng.binomial(n=1, p=probe, size=n)
case 'poisson':
probe = np.exp(X @ beta)
return rng.poisson(lam=probe, size=n)
@Tianbo-Diao what do you mean by "Thus, taking the logarithm may be a better choice" You would like to minimize the log of the current loss ?
@Tianbo-Diao I think you are asking to solve a problem that is too difficult: for your problem, $\alpha{\max}$ value is around $10^6$, and you take $\alpha = 1$. By choosing more reasonable values of $\alpha$, e.g., $\alpha{\max} / 100$ (see the example in this PR https://github.com/scikit-learn-contrib/skglm/pull/252) the problem is solved correctly
@Tianbo-Diao what do you mean by "Thus, taking the logarithm may be a better choice" You would like to minimize the log of the current loss ?
SORRY, not my intention, one can't take logarithms of the loss function.
@Tianbo-Diao I think you are asking to solve a problem that is too difficult: for your problem, αmax value is around 106, and you take α=1. By choosing more reasonable values of α, e.g., αmax/100 (see the example in this PR https://github.com/scikit-learn-contrib/skglm/pull/252) the problem is solved correctly
Thanks for your help. I tried it and the complete steps to solve it are as follows:
np.any(y <= 0)
to np.any(y < 0)
linkmain
file is changed to: result = GeneralizedLinearEstimator(
datafit=datafits.Poisson(),
penalty=penalties.L1(alpha=1e3),
solver=solvers.ProxNewton(max_iter=1000, fit_intercept=False)
).fit(X = popu_X, y = popu_Y).coef__
There are two things to note:
max_iter=1000
is necessary (or some other larger number would be fine) because the default in ProxNewton is max_iter=20
, and when running without changing it, there will be a prompt:
ConvergenceWarning: ProxNewton
did not converge for tol=1.000e-04 and max_iter=20.
Consider increasing max_iter
and/or tol
.fit_intercept=False
I chose to disregard the intercept term, and the final result is a better fit to the true value of betaWhen I ask the question, how was this $\alpha_{\max}$ tested to get it? Adjusting for different $\alpha$ values (both linear and logistic regression), the results of the runs are somewhat different. Is it possible to determine the proper $\alpha$ automatically?
@Tianbo-Diao there is a closed form formula for alpha max. When solving : $$\minx f(x) + \lambda ||x||_1$$ with $f$ convex, if $\lambda \geq ||\nabla f(0)||\infty$, then the solution is 0.
This comes from Fermat's rule aka the first order optimality condition.
I wil lclose this issue now as :
Feel free to open another issue if you encounter other issues, and please advertise skglm's capabilities around you! :rocket:
Shall we close the issue?
Okay, it's a pleasure to have this conversation with you. Thank you for your sincere reply.
System: Windows 10 python version: 3.11.8
code:
In Poisson regression, using the above code will prompt:
However, personally, the value of the response variable y in Poisson regression should be able to take 0. I wonder if this prompt is related to the specific algorithm or some other setting? Is it possible to modify it to allow y to take the value of 0?