scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
60.07k stars 25.39k forks source link

LogisticRegression with lbfgs solver terminates before convergence #18074

Closed geraschenko closed 1 year ago

geraschenko commented 4 years ago

LogisticRegresssion with the lbfgs solver terminates early, even when tol is decreased and max_iter has not been reached.

Code to Reproduce

We fit random data twice, changing only the order of the examples. Ideally, example order should not matter; the fit coefficients should be the same either way. I produced the results below with this code in colab.

from sklearn.linear_model import LogisticRegression
import numpy as np

n_features = 1000
n_examples = 1500

np.random.seed(0)
x = np.random.random((n_examples, n_features))
y = np.random.randint(2, size=n_examples)
max_iter=1000
solver = 'lbfgs'

for tol in [1e-2, 1e-3, 1e-4, 1e-5]:
  np.random.seed(0)
  lr1 = LogisticRegression(solver=solver, max_iter=max_iter, tol=tol).fit(x, y)

  np.random.seed(0)
  lr2 = LogisticRegression(solver=solver, max_iter=max_iter, tol=tol).fit(x[::-1], y[::-1])

  print(f'tol={tol}')
  print(f'  Optimizer iterations, forward order: {lr1.n_iter_[0]}, reverse order: {lr2.n_iter_[0]}.')
  print(f'  Mean absolute diff in coefficients: {np.abs(lr1.coef_ - lr2.coef_).mean()}')

Expected Results

As tol is reduced, the difference between coefficients continues to decrease provided that max_iter is not being hit. When solver is changed to 'newton-cg', we get the expected behavior:

tol=0.01
  Optimizer iterations, forward order: 12, reverse order: 11.
  Mean absolute diff in coefficients: 0.0004846833304941047
tol=0.001
  Optimizer iterations, forward order: 15, reverse order: 14.
  Mean absolute diff in coefficients: 5.4776672871601846e-05
tol=0.0001
  Optimizer iterations, forward order: 19, reverse order: 16.
  Mean absolute diff in coefficients: 1.6047945654930538e-06
tol=1e-05
  Optimizer iterations, forward order: 19, reverse order: 17.
  Mean absolute diff in coefficients: 2.76826465093659e-07

Actual Results

As tol is reduced, the optimizer does not take more steps despite not having converged:

tol=0.01
  Optimizer iterations, forward order: 362, reverse order: 376.
  Mean absolute diff in coefficients: 0.0007590864459748883
tol=0.001
  Optimizer iterations, forward order: 373, reverse order: 401.
  Mean absolute diff in coefficients: 0.0006877678611572595
tol=0.0001
  Optimizer iterations, forward order: 373, reverse order: 401.
  Mean absolute diff in coefficients: 0.0006877678611572595
tol=1e-05
  Optimizer iterations, forward order: 373, reverse order: 401.
  Mean absolute diff in coefficients: 0.0006877678611572595

Versions

Output of sklearn.show_versions():

System:
    python: 3.6.9 (default, Jul 17 2020, 12:50:27)  [GCC 8.4.0]
executable: /usr/bin/python3
   machine: Linux-4.19.104+-x86_64-with-Ubuntu-18.04-bionic

Python dependencies:
          pip: 19.3.1
   setuptools: 49.2.0
      sklearn: 0.23.1
        numpy: 1.18.5
        scipy: 1.4.1
       Cython: 0.29.21
       pandas: 1.0.5
   matplotlib: 3.2.2
       joblib: 0.16.0
threadpoolctl: 2.1.0

Built with OpenMP: True

Diagnosis

I'm pretty sure the issue is in the call to scipy.optimize.minimize at this line in linear_model/_logistic.py. The value of tol is passed to minimize as gtol, but ftol and eps are left at their default values. In the example above, I think the optimizer is hitting the ftol termination condition. Possible solutions:

glemaitre commented 4 years ago

We can turn verbose=1 and check the if f or g reach convergence:

```bash [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. RUNNING THE L-BFGS-B CODE * * * Machine precision = 2.220D-16 N = 1001 M = 10 This problem is unconstrained. At X0 0 variables are exactly at the bounds At iterate 0 f= 1.03972D+03 |proj g|= 2.38301D+01 At iterate 50 f= 4.47023D+02 |proj g|= 5.77432D+00 At iterate 100 f= 4.27660D+02 |proj g|= 3.14005D-01 At iterate 150 f= 4.27494D+02 |proj g|= 7.75738D-02 At iterate 200 f= 4.27446D+02 |proj g|= 2.30958D-01 At iterate 250 f= 4.27201D+02 |proj g|= 1.24021D-02 * * * Tit = total number of iterations Tnf = total number of function evaluations Tnint = total number of segments explored during Cauchy searches Skip = number of BFGS updates skipped Nact = number of active bounds at final generalized Cauchy point Projg = norm of the final projected gradient F = final function value * * * N Tit Tnf Tnint Skip Nact Projg F 1001 271 327 1 0 0 8.862D-03 4.272D+02 F = 427.20051812111285 CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.6s finished [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. RUNNING THE L-BFGS-B CODE * * * Machine precision = 2.220D-16 N = 1001 M = 10 This problem is unconstrained. At X0 0 variables are exactly at the bounds At iterate 0 f= 1.03972D+03 |proj g|= 2.38301D+01 At iterate 50 f= 4.47023D+02 |proj g|= 5.77432D+00 At iterate 100 f= 4.27817D+02 |proj g|= 3.67354D-01 At iterate 150 f= 4.27510D+02 |proj g|= 9.17857D-02 At iterate 200 f= 4.27490D+02 |proj g|= 1.67114D-01 At iterate 250 f= 4.27417D+02 |proj g|= 2.21615D-01 At iterate 300 f= 4.27224D+02 |proj g|= 4.14477D-01 At iterate 350 f= 4.27202D+02 |proj g|= 2.86234D-02 * * * Tit = total number of iterations Tnf = total number of function evaluations Tnint = total number of segments explored during Cauchy searches Skip = number of BFGS updates skipped Nact = number of active bounds at final generalized Cauchy point Projg = norm of the final projected gradient F = final function value * * * N Tit Tnf Tnint Skip Nact Projg F 1001 371 444 1 0 0 8.306D-03 4.272D+02 F = 427.20052096654911 CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.8s finished tol=0.01 Optimizer iterations, forward order: 271, reverse order: 371. Mean absolute diff in coefficients: 0.000274158784978502 max_iter=[271], [371] [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. RUNNING THE L-BFGS-B CODE * * * Machine precision = 2.220D-16 N = 1001 M = 10 This problem is unconstrained. At X0 0 variables are exactly at the bounds At iterate 0 f= 1.03972D+03 |proj g|= 2.38301D+01 At iterate 50 f= 4.47023D+02 |proj g|= 5.77432D+00 At iterate 100 f= 4.27660D+02 |proj g|= 3.14005D-01 At iterate 150 f= 4.27494D+02 |proj g|= 7.75738D-02 At iterate 200 f= 4.27446D+02 |proj g|= 2.30958D-01 At iterate 250 f= 4.27201D+02 |proj g|= 1.24021D-02 * * * Tit = total number of iterations Tnf = total number of function evaluations Tnint = total number of segments explored during Cauchy searches Skip = number of BFGS updates skipped Nact = number of active bounds at final generalized Cauchy point Projg = norm of the final projected gradient F = final function value * * * N Tit Tnf Tnint Skip Nact Projg F 1001 297 359 1 0 0 1.382D-02 4.272D+02 F = 427.20023205690671 CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.6s finished [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. RUNNING THE L-BFGS-B CODE * * * Machine precision = 2.220D-16 N = 1001 M = 10 This problem is unconstrained. At X0 0 variables are exactly at the bounds At iterate 0 f= 1.03972D+03 |proj g|= 2.38301D+01 At iterate 50 f= 4.47023D+02 |proj g|= 5.77432D+00 At iterate 100 f= 4.27817D+02 |proj g|= 3.67354D-01 At iterate 150 f= 4.27510D+02 |proj g|= 9.17857D-02 At iterate 200 f= 4.27490D+02 |proj g|= 1.67114D-01 At iterate 250 f= 4.27417D+02 |proj g|= 2.21615D-01 At iterate 300 f= 4.27224D+02 |proj g|= 4.14477D-01 At iterate 350 f= 4.27202D+02 |proj g|= 2.86234D-02 * * * Tit = total number of iterations Tnf = total number of function evaluations Tnint = total number of segments explored during Cauchy searches Skip = number of BFGS updates skipped Nact = number of active bounds at final generalized Cauchy point Projg = norm of the final projected gradient F = final function value * * * N Tit Tnf Tnint Skip Nact Projg F 1001 383 457 1 0 0 8.612D-03 4.272D+02 F = 427.20036807519477 CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.8s finished tol=0.001 Optimizer iterations, forward order: 297, reverse order: 383. Mean absolute diff in coefficients: 0.00021345692768271518 max_iter=[297], [383] [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. RUNNING THE L-BFGS-B CODE * * * Machine precision = 2.220D-16 N = 1001 M = 10 This problem is unconstrained. At X0 0 variables are exactly at the bounds At iterate 0 f= 1.03972D+03 |proj g|= 2.38301D+01 At iterate 50 f= 4.47023D+02 |proj g|= 5.77432D+00 At iterate 100 f= 4.27660D+02 |proj g|= 3.14005D-01 At iterate 150 f= 4.27494D+02 |proj g|= 7.75738D-02 At iterate 200 f= 4.27446D+02 |proj g|= 2.30958D-01 At iterate 250 f= 4.27201D+02 |proj g|= 1.24021D-02 * * * Tit = total number of iterations Tnf = total number of function evaluations Tnint = total number of segments explored during Cauchy searches Skip = number of BFGS updates skipped Nact = number of active bounds at final generalized Cauchy point Projg = norm of the final projected gradient F = final function value * * * N Tit Tnf Tnint Skip Nact Projg F 1001 297 359 1 0 0 1.382D-02 4.272D+02 F = 427.20023205690671 CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.6s finished [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. RUNNING THE L-BFGS-B CODE * * * Machine precision = 2.220D-16 N = 1001 M = 10 This problem is unconstrained. At X0 0 variables are exactly at the bounds At iterate 0 f= 1.03972D+03 |proj g|= 2.38301D+01 At iterate 50 f= 4.47023D+02 |proj g|= 5.77432D+00 At iterate 100 f= 4.27817D+02 |proj g|= 3.67354D-01 At iterate 150 f= 4.27510D+02 |proj g|= 9.17857D-02 At iterate 200 f= 4.27490D+02 |proj g|= 1.67114D-01 At iterate 250 f= 4.27417D+02 |proj g|= 2.21615D-01 At iterate 300 f= 4.27224D+02 |proj g|= 4.14477D-01 At iterate 350 f= 4.27202D+02 |proj g|= 2.86234D-02 * * * Tit = total number of iterations Tnf = total number of function evaluations Tnint = total number of segments explored during Cauchy searches Skip = number of BFGS updates skipped Nact = number of active bounds at final generalized Cauchy point Projg = norm of the final projected gradient F = final function value * * * N Tit Tnf Tnint Skip Nact Projg F 1001 383 457 1 0 0 8.612D-03 4.272D+02 F = 427.20036807519477 CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.8s finished tol=0.0001 Optimizer iterations, forward order: 297, reverse order: 383. Mean absolute diff in coefficients: 0.00021345692768271518 max_iter=[297], [383] [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. RUNNING THE L-BFGS-B CODE * * * Machine precision = 2.220D-16 N = 1001 M = 10 This problem is unconstrained. At X0 0 variables are exactly at the bounds At iterate 0 f= 1.03972D+03 |proj g|= 2.38301D+01 At iterate 50 f= 4.47023D+02 |proj g|= 5.77432D+00 At iterate 100 f= 4.27660D+02 |proj g|= 3.14005D-01 At iterate 150 f= 4.27494D+02 |proj g|= 7.75738D-02 At iterate 200 f= 4.27446D+02 |proj g|= 2.30958D-01 At iterate 250 f= 4.27201D+02 |proj g|= 1.24021D-02 * * * Tit = total number of iterations Tnf = total number of function evaluations Tnint = total number of segments explored during Cauchy searches Skip = number of BFGS updates skipped Nact = number of active bounds at final generalized Cauchy point Projg = norm of the final projected gradient F = final function value * * * N Tit Tnf Tnint Skip Nact Projg F 1001 297 359 1 0 0 1.382D-02 4.272D+02 F = 427.20023205690671 CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.6s finished [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. RUNNING THE L-BFGS-B CODE * * * Machine precision = 2.220D-16 N = 1001 M = 10 This problem is unconstrained. At X0 0 variables are exactly at the bounds At iterate 0 f= 1.03972D+03 |proj g|= 2.38301D+01 At iterate 50 f= 4.47023D+02 |proj g|= 5.77432D+00 At iterate 100 f= 4.27817D+02 |proj g|= 3.67354D-01 At iterate 150 f= 4.27510D+02 |proj g|= 9.17857D-02 At iterate 200 f= 4.27490D+02 |proj g|= 1.67114D-01 At iterate 250 f= 4.27417D+02 |proj g|= 2.21615D-01 At iterate 300 f= 4.27224D+02 |proj g|= 4.14477D-01 At iterate 350 f= 4.27202D+02 |proj g|= 2.86234D-02 * * * Tit = total number of iterations Tnf = total number of function evaluations Tnint = total number of segments explored during Cauchy searches Skip = number of BFGS updates skipped Nact = number of active bounds at final generalized Cauchy point Projg = norm of the final projected gradient F = final function value * * * N Tit Tnf Tnint Skip Nact Projg F 1001 383 457 1 0 0 8.612D-03 4.272D+02 F = 427.20036807519477 CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.8s finished tol=1e-05 Optimizer iterations, forward order: 297, reverse order: 383. Mean absolute diff in coefficients: 0.00021345692768271518 max_iter=[297], [383] ```
glemaitre commented 4 years ago

With the larger tol, we reach the gtol convergence: CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL while with small tol, we reach the ftol convergence CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH

glemaitre commented 4 years ago

I am not sure how we should expose these tolerances indeed. ping @jnothman @agramfort @GaelVaroquaux @ogrisel @amueller

lorentzenchr commented 4 years ago

It doesn‘t solve the convergence issues of lbfgs, but note that the given example is an optimization problem of 1000+1 parameters (n_fearures + intercept) of which 1000 do not influence the objective at all (or terribly missed something) as y does not depend on x.

Edit: The L2 penalty term of the the objective function depends on the 1000 parameters.

geraschenko commented 4 years ago

@lorentzenchr That's correct. This is random data with random labels, meant simply to demonstrate the convergence issue. If you instead use labels y = (x.mean(axis=1) > .5), you get the same issue that lbfgs terminates after a fixed number of iterations regardless of how small you make tol (though the absolute difference in fit coefficients is less dramatic).

agramfort commented 4 years ago

eps is not relevant here as the grad function is provided. passing ftol: tol in options would fix the problem. Now if it should be a fraction of tol or something it's not obvious.

I would be fine adding ftol: tol

lorentzenchr commented 2 years ago

This might also be related to #24752.

ogrisel commented 1 year ago

I opened #27191 where I set gtol to tol (as previously) and ftol proportional to gtol such that its default value stays the same as previously for the default value of tol=1e-4. It seems to work fine in the sense that none of the existing tests break.

lorentzenchr commented 1 year ago

With recent master, i.e. #26721 merged, (extending until tol 1e-7) I get

tol=0.01
  Optimizer iterations, forward order: 12, reverse order: 12.
  Mean absolute diff in coefficients: 7.475020494078499e-16
tol=0.001
  Optimizer iterations, forward order: 61, reverse order: 61.
  Mean absolute diff in coefficients: 9.457888598980843e-10
tol=0.0001
  Optimizer iterations, forward order: 116, reverse order: 110.
  Mean absolute diff in coefficients: 0.0029428192009488627
tol=1e-05
  Optimizer iterations, forward order: 332, reverse order: 323.
  Mean absolute diff in coefficients: 0.000762248896287633
tol=1e-06
  Optimizer iterations, forward order: 423, reverse order: 401.
  Mean absolute diff in coefficients: 8.578505782753616e-05
tol=1e-07
  Optimizer iterations, forward order: 633, reverse order: 473.
  Mean absolute diff in coefficients: 6.8633468157778965e-06

This looks good, so issue solved.