convergence confusion - Githubissues

inferentialist commented 5 years ago

Hey there. I'm really excited about this package, but I've been seeing some unexpected convergence behavior and was hoping this might be a good place to ask for clarity. I'm happy to write this up in more detail, or dig into the code myself if necessary; let me know what would be most helpful.

Here's the short version of my scenario. I'm running a logistic group lasso with the divergence loss. I'm just ramping up the penalty parameter (lambda) across several orders of magnitude and, in the high-lambda regimes where the non-constant coefficients are forced to zero, I'm not seeing any convergence of the constant-term coefficient.

Tinkering with the tol parameter gets me the numerical behavior I would have expected, but is orders of magnitude slower and (with the verbose parameter set to True) fails to report convergence.

I might have thought that will the non-constant terms forced to zero, estimation of the constant term coefficient would be trivial. So, even with a loose tolerance, I suppose I'm surprised that my options are (a) the constant term coefficient doesn't converge or (b) everything slows down dramatically.

As a point of comparison, the R glmnet package, while less fully featured, doesn't seem to suffer from the issue.

Let me know if you have any thoughts about how I might help or where I should start digging around in the code. Thanks!

jasmainak commented 5 years ago

Can I ask if you are using the development version? And which solver are you using?

It's a known issue for the cdfast solver. See: https://github.com/glm-tools/pyglmnet/issues/275 and https://github.com/glm-tools/pyglmnet/issues/289

inferentialist commented 5 years ago

This reproduces the issue. I'm using mostly defaults; in particular, the "batch-gradient" solver. I assume I'm using the development version as well: git log dates this as a Jun 3 2019 commit.


# $ git log

# commit c962c840c4669ae69e623a2595eba40ae50e460d 
#                         (HEAD -> master, origin/master, origin/HEAD)
# Author: Mainak Jas <mainakjas@gmail.com>
# Date:   Mon Jun 3 23:56:36 2019 -0400

#     FIX test and tweak to whats_new

import pandas as pd
import numpy as np
from numpy.linalg import norm
import itertools
from pyglmnet import GLM

# read the data

df = pd.read_csv(r"https://dlennon.org/assets/repro/bddae97a.gz")

y = df['Churn']
X = df.drop('Churn', axis=1)

# group ids
pref    = [c.split('_')[0] for c in X.columns]
grp_id  = [j for j,r in enumerate(itertools.groupby(pref), start=1) for k in r[1]]

# fit models as a function of lambda
n_mod   = 13
models  = []
lam     = np.logspace(-6,6,num=n_mod)

model_params = dict(
  distr   = 'binomial', 
  group   = grp_id,
  tol     = 0.,
  verbose = 3
)

for i,l in enumerate(lam):
  clf = GLM(reg_lambda = l, **model_params)

  # warm start, if possible
  try:
    clf.beta0_  = models[i-1].beta0_
    clf.beta_   = models[i-1].beta_
  except IndexError:
    pass

  clf.fit(X.values, y.values)
  models.append(clf)

# ------------------------------------------------------------------------------
# These are the tol=0.5 results.  Beta is zero, but beta0 is not constant and
# not converging very quickly to -1.016114.

# In [68]: %paste                                                                                                                     
# pd.DataFrame({
#   'lambda'    : lam,
#   'beta0'     : [m.beta0_ for m in models], 
#   'norm_beta' : [norm(m.beta_) for m in models]
#   })
# ## -- End pasted text --
# Out[68]: 
#             lambda     beta0  norm_beta
# 0         0.000001 -0.042286   0.265220
# 1         0.000010 -0.075504   0.370958
# 2         0.000100 -0.092875   0.452565
# 3         0.001000 -0.104283   0.518814
# 4         0.010000 -0.114575   0.541784
# 5         0.100000 -0.139812   0.238829
# 6         1.000000 -0.372410   0.000000
# 7        10.000000 -0.427910   0.000000
# 8       100.000000 -0.478218   0.000000
# 9      1000.000000 -0.523872   0.000000
# 10    10000.000000 -0.565348   0.000000
# 11   100000.000000 -0.603068   0.000000
# 12  1000000.000000 -0.637410   0.000000

# ------------------------------------------------------------------------------

# Taking tol = 0.1 gives the expected behavior but this is much slower and 
# no convergence (from verbose=True) is reported after lambda > 0.1

# In [70]: pd.DataFrame({ 
#     ...:   'lambda'    : lam, 
#     ...:   'beta0'     : [m.beta0_ for m in models],  
#     ...:   'norm_beta' : [norm(m.beta_) for m in models] 
#     ...:   }) 
#     ...:                                                                                                                            
# Out[70]: 
#             lambda     beta0  norm_beta
# 0         0.000001 -0.139408   0.798327
# 1         0.000010 -0.143974   0.834305
# 2         0.000100 -0.148182   0.866839
# 3         0.001000 -0.152151   0.892953
# 4         0.010000 -0.156752   0.879674
# 5         0.100000 -1.016114   0.000000
# 6         1.000000 -1.016114   0.000000
# 7        10.000000 -1.016114   0.000000
# 8       100.000000 -1.016114   0.000000
# 9      1000.000000 -1.016114   0.000000
# 10    10000.000000 -1.016114   0.000000
# 11   100000.000000 -1.016114   0.000000
# 12  1000000.000000 -1.016114   0.000000

jasmainak commented 5 years ago

hi @inferentialist sorry for the delay in getting back to you. I finally had a moment to look at this but I can't reproduce your problem. See my code here:

import pandas as pd
import numpy as np
from numpy.linalg import norm
import itertools
from pyglmnet import GLM

# read the data

df = pd.read_csv(r"https://dlennon.org/assets/repro/bddae97a.gz")

y = df['Churn']
X = df.drop('Churn', axis=1)

# group ids
pref    = [c.split('_')[0] for c in X.columns]
grp_id  = [j for j,r in enumerate(itertools.groupby(pref), start=1) for k in r[1]]

# fit models as a function of lambda
n_mod   = 13
models  = []
lam     = np.logspace(-6,6,num=n_mod)

model_params = dict(
  distr   = 'binomial', 
  group   = grp_id,
  tol     = 0.1,
  verbose = 3,
  max_iter = 1000
)

for i,l in enumerate(lam):
  clf = GLM(reg_lambda = l, **model_params)

  # warm start, if possible
  try:
    clf.beta0_  = models[i-1].beta0_
    clf.beta_   = models[i-1].beta_
  except IndexError:
    pass

  clf.fit(X.values, y.values)
  models.append(clf)
  print(clf.beta0_)

It seems that convergence is just fine even for lambda > 1. I tried both with latest master and the commit you are on. For tol=0.5 I would expect it not to converge, so nothing unexpected there. It seems a bit slow for tol=0.1 but not too bad on the dataset size you have. I have never used the R package, so I don't have a point of reference. But if you want to help write a faster solver, I would be very happy to review and merge that code into pyglmnet.

inferentialist commented 5 years ago

@jasmainak, thanks for taking a look. I was hoping you'd have a quick fix; sorry to have handed you a difficult to reproduce result. For what it's worth, I love the idea of what you're doing here with pyglmnet, and I'd like to get involved. In terms of references, I'd probably be inclined to start with the Lukas Meier JRSSB 2008 paper, if only because I seem to recall that the R version of glmnet is based on it. If you have suggestions, beyond what's on the documentation webpage, on how to get from Lukas' paper to the current codebase, definitely feel free to make suggestions.

jasmainak commented 5 years ago

If the R version of glmnet is based on it, I think it sounds like a great idea. What solver does it use?

By the way if you are working with group lasso, it would be great if you can also suggest an example dataset. Our current example takes ages to finish ... not sure how much a better solver would help.

inferentialist commented 5 years ago

Hi Jas,

Sorry to have sat on this for the last two weeks. I've been caught up in other projects.

So to get back to your question about solvers, my understanding is that the Meier 2008 paper puts forward an argument for block coordinate solvers because, for this class of problem, the penalty term is separable, and so the sub-problem is relatively simple. I think, at the time of the publication, this was a departure from the idea of using a homotopy method to get solutions across all lambdas.

In terms of examples, one thing I've found helpful in the past is to start with synthetic data. For these types of problems, I can think of two characteristics that are worth replicating in a simulation. First, the lasso is often applied to categorical data which leads to sparse data matrices. Second, lasso methods are nice when p>n and most of the covariates are noisy relative to the response variable.

I suppose the last thing that comes to mind is to ask about your current working example. There's the community crime dataset in the repo, so that'd be my default assumption. If it's something else, and you are in a place to share it, let's sync on that. I also feel like I saw somewhere that you were considering a JSS submission. I haven't tracked that journal in a while, so maybe they welcome v2.0 articles, but if we can make some additional progress here, I'd expect to be involved in that publication as a contributing author.

Looking forward to hearing your thoughts!

Dustin

On Tue, Aug 13, 2019 at 5:05 AM Mainak Jas notifications@github.com wrote:

If the R version of glmnet is based on it, I think it sounds like a great idea. What solver does it use?

By the way if you are working with group lasso, it would be great if you can also suggest an example dataset. Our current example takes ages to finish ... not sure how much a better solver would help.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/glm-tools/pyglmnet/issues/311?email_source=notifications&email_token=ABKABM576MS36NBQNCCJ4ELQEKPQRA5CNFSM4IJG27HKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4FOF6Q#issuecomment-520807162, or mute the thread https://github.com/notifications/unsubscribe-auth/ABKABMYOZ3JT2ODEM7CVL6LQEKPQRANCNFSM4IJG27HA .

jasmainak commented 5 years ago

hi @inferentialist I had a brief look at the paper you mention, it sounds promising. But we will need some work to integrate it into the current codebase. I'd be happy to add you as a contributing author of course, the criteria will be quite liberal anyway and if you help integrating these block coordinate descent solvers, it would certainly be a sizable contribution.

I would start small with a WIP pull request so you don't spend too much time in the wrong direction and we can iterate together. The current group lasso example (originally implemented by @evadyer) takes over an hour to run! I think we use the same data as in the paper that you cite so comparison should be easy between packages.

and before you dig any further, let me also point you to this open PR: https://github.com/glm-tools/pyglmnet/pull/226/files. Feel free to review and/or try it. I see that the paper uses line search, I wonder if it already solves some problems.

pavanramkumar commented 5 years ago

Hey @inferentialist thanks for your thoughtful suggestions regarding alternate solvers. We've made some improvements to our convergence criteria recently and have made a new release on PyPI along with various improvements. You can pip install the latest version. Let us know if these improvements can solve your convergence problems.

As far as contributions go for future releases, we'd love to make our coordinate descent based solver faster: it's a better solver than batch gradient in terms of convergence properties in theory but it isn't the case in practice simply because our current implementation is pure Python. Translating the solver to cython is a potential enhancement. Also happy to explore the block-coordinate direction if that's your interest.

glm-tools / pyglmnet

convergence confusion #311