scikit-learn-contrib / skglm

Fast and modular sklearn replacement for generalized linear models
http://contrib.scikit-learn.org/skglm
BSD 3-Clause "New" or "Revised" License
158 stars 32 forks source link

Poisson regression problems in skglm #251

Closed Tianbo-Diao closed 5 months ago

Tianbo-Diao commented 5 months ago

System: Windows 10 python version: 3.11.8

code:

import numpy as np
from time import time
from skglm import GeneralizedLinearEstimator
from skglm import datafits
from skglm import penalties

n = 200
p = 5000
beta_value = np.array([5, 3, 0, 0, -2, 0] + [0] * (p - 6))

start_time = time()
popu_X = simulator.simu_X(n, p, example='hete')
popu_Y = simulator.simu_Y(popu_X, n, beta=beta_value, type='poisson')
print('simulate data time:', time() - start_time)

start_time = time()
result =  GeneralizedLinearEstimator(datafit= datafits.Poisson(), penalty=penalties.L1(alpha=1)).fit(X = popu_X, y = popu_Y).coef_
print('simulate data time:', time() - start_time)
print( np.where(result !=0)[0] )
print( result[np.where(result !=0)[0]] )

In Poisson regression, using the above code will prompt:

ValueError: Target vector `y` should only take positive values when fitting a Poisson model.

However, personally, the value of the response variable y in Poisson regression should be able to take 0. I wonder if this prompt is related to the specific algorithm or some other setting? Is it possible to modify it to allow y to take the value of 0?

QB3 commented 5 months ago

I gave a quick look a the implementation of the Poisson datafit, at first glance it seems that $0$ value for $y$ should not be a problem.

Are you aware of specific problems/different formulas when the target $y$ takes value $0$? (I ran into this, which links to this). If not we can switch the np.any(y <= 0) to np.any(y < 0) (https://github.com/scikit-learn-contrib/skglm/blob/8aea6117db5f890b1b28af876ff3dfaefd3723b0/skglm/datafits/single_task.py#L435)

Tianbo-Diao commented 5 months ago

Hello, the email has been received and will be dealt with as soon as possible.

Tianbo-Diao commented 5 months ago

Thanks for your reply, as you said, poisson regression encounters this if the loss function (negative of the maximum likelihood function) used is $\frac{1}{n} L(\boldsymbol{\beta}) = \frac{1}{n} \sum_{i \leq n} \left[ e^{ {x}_i^{\top} {\beta} } - ({x}_i^{\top} {\beta}) y_i \right]$ (I wrote a similar program before and tested it with this hint):

RuntimeWarning: overflow encountered in exp
  return ( np.sum(np.exp(self.X @ beta)) - self.Y.T @ self.X @ beta ) /self.n

Thus, taking the logarithm may be a better choice, but then you have to deal with the case where the response variable is zero.

I tried to follow your suggestion: switch the np.any(y <= 0) to np.any(y < 0) (https://github.com/scikit-learn-contrib/skglm/blob/8aea6117db5f890b1b28af876ff3dfaefd3723b0/skglm/datafits/single_task.py#L435) and a new problem arose:

  File "D:\Python\Lib\site-packages\skglm\solvers\anderson_cd.py", line 83, in solve
    lipschitz = datafit.get_lipschitz(X, y)
                ^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'Poisson' object has no attribute 'get_lipschitz'

I forgot the code to generate the data, it's in the simulator file as follows:

import numpy as np
rng = np.random.default_rng()

def simu_X(n, p, example):
    if example == 'iid':
        return rng.normal(size=(n, p))
    elif example == 'hete':
        mean_vec = np.zeros(p)
        cov_mat = 0.5 ** np.abs( np.arange(1, p + 1).reshape(-1,1) - np.arange(1, p + 1) )
        return rng.multivariate_normal(mean_vec, cov_mat, n, method='cholesky')    

def simu_Y(X, n, beta, type):
    match type:
        case 'linear':
            return X @ beta + rng.normal(size=n)
        case 'logistic':
            probe = 1 / (1 + np.exp(-X @ beta))
            return rng.binomial(n=1, p=probe, size=n)
        case 'poisson':
            probe = np.exp(X @ beta)
            return rng.poisson(lam=probe, size=n)

In using skglm for generalized linear modeling, it seems that the accuracy of the linear regression results is OK, but the accuracy of the logistic regression and poisson regression results are poorer, this phenomenon is also found in the glmnet package and the ncvreg package in R. Can there be a better explanation?

Thanks again for your reply.

mathurinm commented 5 months ago

@Tianbo-Diao could you edit your messages to format your code with triple backticks ``` so that it renders properly, and identation is not lost ? Like this

def f():
     return 1 

You can also use the code button above the message box : image

Tianbo-Diao commented 5 months ago

I'm sorry if I've caused you any trouble by not being able to use submit issue proficiently.

The code runs the file main:

import numpy as np
from time import time
from skglm import GeneralizedLinearEstimator
from skglm import datafits
from skglm import penalties

# loading other Python files
import simulator

n = 200
p = 5000
beta_value = np.array([5, 3, 0, 0, -2, 0] + [0] * (p - 6))
# For other scenarios, uncomment and use appropriate beta values:
# iid: np.array([0, 1.8, -2, 0, 1.4, 0] + [0] * (p - 6))
# hete: np.array([5, 3, 0, 0, -2, 0] + [0] * (p - 6))

# simulate data
start_time = time()
popu_X = simulator.simu_X(n, p, example='hete')
popu_Y = simulator.simu_Y(popu_X, n, beta=beta_value, type='poisson')
print('simulate data time:', time() - start_time)

# start_time = time()
result =  GeneralizedLinearEstimator(datafit= datafits.Poisson(), penalty=penalties.L1(alpha=1)).fit(X = popu_X, y = popu_Y).coef_
print('processing time:', time() - start_time)
print( np.where(result !=0)[0] )
print( result[np.where(result !=0)[0]] )

Generate data for the file simulator:

import numpy as np

rng = np.random.default_rng()

def simu_X(n, p, example):
    if example == 'iid':
        return rng.normal(size=(n, p))
    elif example == 'hete':
        mean_vec = np.zeros(p)
        cov_mat = 0.5 ** np.abs( np.arange(1, p + 1).reshape(-1,1) - np.arange(1, p + 1) )
        return rng.multivariate_normal(mean_vec, cov_mat, n, method='cholesky')

def simu_Y(X, n, beta, type):
    match type:
        case 'linear':
            return X @ beta + rng.normal(size=n)
        case 'logistic':
            probe = 1 / (1 + np.exp(-X @ beta))
            return rng.binomial(n=1, p=probe, size=n)
        case 'poisson':
            probe = np.exp(X @ beta)
            return rng.poisson(lam=probe, size=n)
mathurinm commented 5 months ago

@Tianbo-Diao what do you mean by "Thus, taking the logarithm may be a better choice" You would like to minimize the log of the current loss ?

QB3 commented 5 months ago

@Tianbo-Diao I think you are asking to solve a problem that is too difficult: for your problem, $\alpha{\max}$ value is around $10^6$, and you take $\alpha = 1$. By choosing more reasonable values of $\alpha$, e.g., $\alpha{\max} / 100$ (see the example in this PR https://github.com/scikit-learn-contrib/skglm/pull/252) the problem is solved correctly

Tianbo-Diao commented 5 months ago

@Tianbo-Diao what do you mean by "Thus, taking the logarithm may be a better choice" You would like to minimize the log of the current loss ?

SORRY, not my intention, one can't take logarithms of the loss function.

Tianbo-Diao commented 5 months ago

@Tianbo-Diao I think you are asking to solve a problem that is too difficult: for your problem, αmax value is around 106, and you take α=1. By choosing more reasonable values of α, e.g., αmax/100 (see the example in this PR https://github.com/scikit-learn-contrib/skglm/pull/252) the problem is solved correctly

Thanks for your help. I tried it and the complete steps to solve it are as follows:

  1. switch the np.any(y <= 0) to np.any(y < 0) link
  2. The code in the main file is changed to:
result =  GeneralizedLinearEstimator(
    datafit=datafits.Poisson(),
    penalty=penalties.L1(alpha=1e3), 
    solver=solvers.ProxNewton(max_iter=1000, fit_intercept=False)
).fit(X = popu_X, y = popu_Y).coef__

There are two things to note:

When I ask the question, how was this $\alpha_{\max}$ tested to get it? Adjusting for different $\alpha$ values (both linear and logistic regression), the results of the runs are somewhat different. Is it possible to determine the proper $\alpha$ automatically?

mathurinm commented 5 months ago

@Tianbo-Diao there is a closed form formula for alpha max. When solving : $$\minx f(x) + \lambda ||x||_1$$ with $f$ convex, if $\lambda \geq ||\nabla f(0)||\infty$, then the solution is 0.

This comes from Fermat's rule aka the first order optimality condition.

I wil lclose this issue now as :

Feel free to open another issue if you encounter other issues, and please advertise skglm's capabilities around you! :rocket:

QB3 commented 5 months ago

Shall we close the issue?

Tianbo-Diao commented 5 months ago

Okay, it's a pleasure to have this conversation with you. Thank you for your sincere reply.