LogisticRegression with lbfgs solver terminates before convergence

geraschenko commented 4 years ago

LogisticRegresssion with the lbfgs solver terminates early, even when tol is decreased and max_iter has not been reached.

Code to Reproduce

We fit random data twice, changing only the order of the examples. Ideally, example order should not matter; the fit coefficients should be the same either way. I produced the results below with this code in colab.

from sklearn.linear_model import LogisticRegression
import numpy as np

n_features = 1000
n_examples = 1500

np.random.seed(0)
x = np.random.random((n_examples, n_features))
y = np.random.randint(2, size=n_examples)
max_iter=1000
solver = 'lbfgs'

for tol in [1e-2, 1e-3, 1e-4, 1e-5]:
  np.random.seed(0)
  lr1 = LogisticRegression(solver=solver, max_iter=max_iter, tol=tol).fit(x, y)

  np.random.seed(0)
  lr2 = LogisticRegression(solver=solver, max_iter=max_iter, tol=tol).fit(x[::-1], y[::-1])

  print(f'tol={tol}')
  print(f'  Optimizer iterations, forward order: {lr1.n_iter_[0]}, reverse order: {lr2.n_iter_[0]}.')
  print(f'  Mean absolute diff in coefficients: {np.abs(lr1.coef_ - lr2.coef_).mean()}')

Expected Results

As tol is reduced, the difference between coefficients continues to decrease provided that max_iter is not being hit. When solver is changed to 'newton-cg', we get the expected behavior:

tol=0.01
  Optimizer iterations, forward order: 12, reverse order: 11.
  Mean absolute diff in coefficients: 0.0004846833304941047
tol=0.001
  Optimizer iterations, forward order: 15, reverse order: 14.
  Mean absolute diff in coefficients: 5.4776672871601846e-05
tol=0.0001
  Optimizer iterations, forward order: 19, reverse order: 16.
  Mean absolute diff in coefficients: 1.6047945654930538e-06
tol=1e-05
  Optimizer iterations, forward order: 19, reverse order: 17.
  Mean absolute diff in coefficients: 2.76826465093659e-07

Actual Results

As tol is reduced, the optimizer does not take more steps despite not having converged:

tol=0.01
  Optimizer iterations, forward order: 362, reverse order: 376.
  Mean absolute diff in coefficients: 0.0007590864459748883
tol=0.001
  Optimizer iterations, forward order: 373, reverse order: 401.
  Mean absolute diff in coefficients: 0.0006877678611572595
tol=0.0001
  Optimizer iterations, forward order: 373, reverse order: 401.
  Mean absolute diff in coefficients: 0.0006877678611572595
tol=1e-05
  Optimizer iterations, forward order: 373, reverse order: 401.
  Mean absolute diff in coefficients: 0.0006877678611572595

Versions

Output of sklearn.show_versions():

System:
    python: 3.6.9 (default, Jul 17 2020, 12:50:27)  [GCC 8.4.0]
executable: /usr/bin/python3
   machine: Linux-4.19.104+-x86_64-with-Ubuntu-18.04-bionic

Python dependencies:
          pip: 19.3.1
   setuptools: 49.2.0
      sklearn: 0.23.1
        numpy: 1.18.5
        scipy: 1.4.1
       Cython: 0.29.21
       pandas: 1.0.5
   matplotlib: 3.2.2
       joblib: 0.16.0
threadpoolctl: 2.1.0

Built with OpenMP: True

Diagnosis

I'm pretty sure the issue is in the call to scipy.optimize.minimize at this line in linear_model/_logistic.py. The value of tol is passed to minimize as gtol, but ftol and eps are left at their default values. In the example above, I think the optimizer is hitting the ftol termination condition. Possible solutions:

Scale down ftol and eps by some multiple of tol.
Scale down eps by some multiple of tol and set ftol to zero.
Allow the user of LogisticRegression to control ftol and eps through additional kwargs.

glemaitre commented 4 years ago

We can turn verbose=1 and check the if f or g reach convergence:

```bash [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. RUNNING THE L-BFGS-B CODE * * * Machine precision = 2.220D-16 N = 1001 M = 10 This problem is unconstrained. At X0 0 variables are exactly at the bounds At iterate 0 f= 1.03972D+03 |proj g|= 2.38301D+01 At iterate 50 f= 4.47023D+02 |proj g|= 5.77432D+00 At iterate 100 f= 4.27660D+02 |proj g|= 3.14005D-01 At iterate 150 f= 4.27494D+02 |proj g|= 7.75738D-02 At iterate 200 f= 4.27446D+02 |proj g|= 2.30958D-01 At iterate 250 f= 4.27201D+02 |proj g|= 1.24021D-02 * * * Tit = total number of iterations Tnf = total number of function evaluations Tnint = total number of segments explored during Cauchy searches Skip = number of BFGS updates skipped Nact = number of active bounds at final generalized Cauchy point Projg = norm of the final projected gradient F = final function value * * * N Tit Tnf Tnint Skip Nact Projg F 1001 271 327 1 0 0 8.862D-03 4.272D+02 F = 427.20051812111285 CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.6s finished [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. RUNNING THE L-BFGS-B CODE * * * Machine precision = 2.220D-16 N = 1001 M = 10 This problem is unconstrained. At X0 0 variables are exactly at the bounds At iterate 0 f= 1.03972D+03 |proj g|= 2.38301D+01 At iterate 50 f= 4.47023D+02 |proj g|= 5.77432D+00 At iterate 100 f= 4.27817D+02 |proj g|= 3.67354D-01 At iterate 150 f= 4.27510D+02 |proj g|= 9.17857D-02 At iterate 200 f= 4.27490D+02 |proj g|= 1.67114D-01 At iterate 250 f= 4.27417D+02 |proj g|= 2.21615D-01 At iterate 300 f= 4.27224D+02 |proj g|= 4.14477D-01 At iterate 350 f= 4.27202D+02 |proj g|= 2.86234D-02 * * * Tit = total number of iterations Tnf = total number of function evaluations Tnint = total number of segments explored during Cauchy searches Skip = number of BFGS updates skipped Nact = number of active bounds at final generalized Cauchy point Projg = norm of the final projected gradient F = final function value * * * N Tit Tnf Tnint Skip Nact Projg F 1001 371 444 1 0 0 8.306D-03 4.272D+02 F = 427.20052096654911 CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.8s finished tol=0.01 Optimizer iterations, forward order: 271, reverse order: 371. Mean absolute diff in coefficients: 0.000274158784978502 max_iter=[271], [371] [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. RUNNING THE L-BFGS-B CODE * * * Machine precision = 2.220D-16 N = 1001 M = 10 This problem is unconstrained. At X0 0 variables are exactly at the bounds At iterate 0 f= 1.03972D+03 |proj g|= 2.38301D+01 At iterate 50 f= 4.47023D+02 |proj g|= 5.77432D+00 At iterate 100 f= 4.27660D+02 |proj g|= 3.14005D-01 At iterate 150 f= 4.27494D+02 |proj g|= 7.75738D-02 At iterate 200 f= 4.27446D+02 |proj g|= 2.30958D-01 At iterate 250 f= 4.27201D+02 |proj g|= 1.24021D-02 * * * Tit = total number of iterations Tnf = total number of function evaluations Tnint = total number of segments explored during Cauchy searches Skip = number of BFGS updates skipped Nact = number of active bounds at final generalized Cauchy point Projg = norm of the final projected gradient F = final function value * * * N Tit Tnf Tnint Skip Nact Projg F 1001 297 359 1 0 0 1.382D-02 4.272D+02 F = 427.20023205690671 CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.6s finished [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. RUNNING THE L-BFGS-B CODE * * * Machine precision = 2.220D-16 N = 1001 M = 10 This problem is unconstrained. At X0 0 variables are exactly at the bounds At iterate 0 f= 1.03972D+03 |proj g|= 2.38301D+01 At iterate 50 f= 4.47023D+02 |proj g|= 5.77432D+00 At iterate 100 f= 4.27817D+02 |proj g|= 3.67354D-01 At iterate 150 f= 4.27510D+02 |proj g|= 9.17857D-02 At iterate 200 f= 4.27490D+02 |proj g|= 1.67114D-01 At iterate 250 f= 4.27417D+02 |proj g|= 2.21615D-01 At iterate 300 f= 4.27224D+02 |proj g|= 4.14477D-01 At iterate 350 f= 4.27202D+02 |proj g|= 2.86234D-02 * * * Tit = total number of iterations Tnf = total number of function evaluations Tnint = total number of segments explored during Cauchy searches Skip = number of BFGS updates skipped Nact = number of active bounds at final generalized Cauchy point Projg = norm of the final projected gradient F = final function value * * * N Tit Tnf Tnint Skip Nact Projg F 1001 383 457 1 0 0 8.612D-03 4.272D+02 F = 427.20036807519477 CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.8s finished tol=0.001 Optimizer iterations, forward order: 297, reverse order: 383. Mean absolute diff in coefficients: 0.00021345692768271518 max_iter=[297], [383] [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. RUNNING THE L-BFGS-B CODE * * * Machine precision = 2.220D-16 N = 1001 M = 10 This problem is unconstrained. At X0 0 variables are exactly at the bounds At iterate 0 f= 1.03972D+03 |proj g|= 2.38301D+01 At iterate 50 f= 4.47023D+02 |proj g|= 5.77432D+00 At iterate 100 f= 4.27660D+02 |proj g|= 3.14005D-01 At iterate 150 f= 4.27494D+02 |proj g|= 7.75738D-02 At iterate 200 f= 4.27446D+02 |proj g|= 2.30958D-01 At iterate 250 f= 4.27201D+02 |proj g|= 1.24021D-02 * * * Tit = total number of iterations Tnf = total number of function evaluations Tnint = total number of segments explored during Cauchy searches Skip = number of BFGS updates skipped Nact = number of active bounds at final generalized Cauchy point Projg = norm of the final projected gradient F = final function value * * * N Tit Tnf Tnint Skip Nact Projg F 1001 297 359 1 0 0 1.382D-02 4.272D+02 F = 427.20023205690671 CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.6s finished [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. RUNNING THE L-BFGS-B CODE * * * Machine precision = 2.220D-16 N = 1001 M = 10 This problem is unconstrained. At X0 0 variables are exactly at the bounds At iterate 0 f= 1.03972D+03 |proj g|= 2.38301D+01 At iterate 50 f= 4.47023D+02 |proj g|= 5.77432D+00 At iterate 100 f= 4.27817D+02 |proj g|= 3.67354D-01 At iterate 150 f= 4.27510D+02 |proj g|= 9.17857D-02 At iterate 200 f= 4.27490D+02 |proj g|= 1.67114D-01 At iterate 250 f= 4.27417D+02 |proj g|= 2.21615D-01 At iterate 300 f= 4.27224D+02 |proj g|= 4.14477D-01 At iterate 350 f= 4.27202D+02 |proj g|= 2.86234D-02 * * * Tit = total number of iterations Tnf = total number of function evaluations Tnint = total number of segments explored during Cauchy searches Skip = number of BFGS updates skipped Nact = number of active bounds at final generalized Cauchy point Projg = norm of the final projected gradient F = final function value * * * N Tit Tnf Tnint Skip Nact Projg F 1001 383 457 1 0 0 8.612D-03 4.272D+02 F = 427.20036807519477 CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.8s finished tol=0.0001 Optimizer iterations, forward order: 297, reverse order: 383. Mean absolute diff in coefficients: 0.00021345692768271518 max_iter=[297], [383] [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. RUNNING THE L-BFGS-B CODE * * * Machine precision = 2.220D-16 N = 1001 M = 10 This problem is unconstrained. At X0 0 variables are exactly at the bounds At iterate 0 f= 1.03972D+03 |proj g|= 2.38301D+01 At iterate 50 f= 4.47023D+02 |proj g|= 5.77432D+00 At iterate 100 f= 4.27660D+02 |proj g|= 3.14005D-01 At iterate 150 f= 4.27494D+02 |proj g|= 7.75738D-02 At iterate 200 f= 4.27446D+02 |proj g|= 2.30958D-01 At iterate 250 f= 4.27201D+02 |proj g|= 1.24021D-02 * * * Tit = total number of iterations Tnf = total number of function evaluations Tnint = total number of segments explored during Cauchy searches Skip = number of BFGS updates skipped Nact = number of active bounds at final generalized Cauchy point Projg = norm of the final projected gradient F = final function value * * * N Tit Tnf Tnint Skip Nact Projg F 1001 297 359 1 0 0 1.382D-02 4.272D+02 F = 427.20023205690671 CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.6s finished [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. RUNNING THE L-BFGS-B CODE * * * Machine precision = 2.220D-16 N = 1001 M = 10 This problem is unconstrained. At X0 0 variables are exactly at the bounds At iterate 0 f= 1.03972D+03 |proj g|= 2.38301D+01 At iterate 50 f= 4.47023D+02 |proj g|= 5.77432D+00 At iterate 100 f= 4.27817D+02 |proj g|= 3.67354D-01 At iterate 150 f= 4.27510D+02 |proj g|= 9.17857D-02 At iterate 200 f= 4.27490D+02 |proj g|= 1.67114D-01 At iterate 250 f= 4.27417D+02 |proj g|= 2.21615D-01 At iterate 300 f= 4.27224D+02 |proj g|= 4.14477D-01 At iterate 350 f= 4.27202D+02 |proj g|= 2.86234D-02 * * * Tit = total number of iterations Tnf = total number of function evaluations Tnint = total number of segments explored during Cauchy searches Skip = number of BFGS updates skipped Nact = number of active bounds at final generalized Cauchy point Projg = norm of the final projected gradient F = final function value * * * N Tit Tnf Tnint Skip Nact Projg F 1001 383 457 1 0 0 8.612D-03 4.272D+02 F = 427.20036807519477 CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.8s finished tol=1e-05 Optimizer iterations, forward order: 297, reverse order: 383. Mean absolute diff in coefficients: 0.00021345692768271518 max_iter=[297], [383] ```

glemaitre commented 4 years ago

With the larger tol, we reach the gtol convergence: CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL while with small tol, we reach the ftol convergence CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH

glemaitre commented 4 years ago

I am not sure how we should expose these tolerances indeed. ping @jnothman @agramfort @GaelVaroquaux @ogrisel @amueller

lorentzenchr commented 4 years ago

It doesn‘t solve the convergence issues of lbfgs, but note that the given example is an optimization problem of 1000+1 parameters (n_fearures + intercept) of which 1000 do not influence the objective at all (or terribly missed something) as y does not depend on x.

Edit: The L2 penalty term of the the objective function depends on the 1000 parameters.

geraschenko commented 4 years ago

@lorentzenchr That's correct. This is random data with random labels, meant simply to demonstrate the convergence issue. If you instead use labels y = (x.mean(axis=1) > .5), you get the same issue that lbfgs terminates after a fixed number of iterations regardless of how small you make tol (though the absolute difference in fit coefficients is less dramatic).

agramfort commented 4 years ago

eps is not relevant here as the grad function is provided. passing ftol: tol in options would fix the problem. Now if it should be a fraction of tol or something it's not obvious.

I would be fine adding ftol: tol

lorentzenchr commented 2 years ago

This might also be related to #24752.

ogrisel commented 1 year ago

I opened #27191 where I set gtol to tol (as previously) and ftol proportional to gtol such that its default value stays the same as previously for the default value of tol=1e-4. It seems to work fine in the sense that none of the existing tests break.

lorentzenchr commented 1 year ago

With recent master, i.e. #26721 merged, (extending until tol 1e-7) I get

tol=0.01
  Optimizer iterations, forward order: 12, reverse order: 12.
  Mean absolute diff in coefficients: 7.475020494078499e-16
tol=0.001
  Optimizer iterations, forward order: 61, reverse order: 61.
  Mean absolute diff in coefficients: 9.457888598980843e-10
tol=0.0001
  Optimizer iterations, forward order: 116, reverse order: 110.
  Mean absolute diff in coefficients: 0.0029428192009488627
tol=1e-05
  Optimizer iterations, forward order: 332, reverse order: 323.
  Mean absolute diff in coefficients: 0.000762248896287633
tol=1e-06
  Optimizer iterations, forward order: 423, reverse order: 401.
  Mean absolute diff in coefficients: 8.578505782753616e-05
tol=1e-07
  Optimizer iterations, forward order: 633, reverse order: 473.
  Mean absolute diff in coefficients: 6.8633468157778965e-06

This looks good, so issue solved.

scikit-learn / scikit-learn