pbreheny / ncvreg

Regularization paths for SCAD- and MCP-penalized regression models
http://pbreheny.github.io/ncvreg
41 stars 28 forks source link

Changed max.iter to 5000 for better chance of convergence #4

Closed nanxstats closed 7 years ago

nanxstats commented 7 years ago

In some cases, ncvsurv() could be "harder" to converge under the new RMSD criterion.

For example:

library("ncvreg")
library("hdnom")
library("survival")

# first 150 obs
data("smart")
x = as.matrix(smart[, -c(1, 2)])[1:150, ]
time = smart$TEVENT[1:150]
event = smart$EVENT[1:150]
y = Surv(time, event)

fit = ncvsurv(x, y, penalty = "MCP", alpha = 1, gamma = 2)
fit = ncvsurv(x, y, penalty = "MCP", alpha = 1, gamma = 2, max.iter = 5000)
fit = ncvsurv(x, y, penalty = "SCAD", alpha = 1, gamma = 3.7)
fit = ncvsurv(x, y, penalty = "SCAD", alpha = 1, gamma = 3.7, max.iter = 5000)

The 1st and 3rd fit used to converge under the old criterion.

This PR increased the default value of max.iter from 1000 to 5000, which could give a better chance for convergence under the new criterion. Since the iterations here are really fast, this change will not likely to have a significant performance impact.

pbreheny commented 7 years ago

Interesting example. I have several thoughts on this.

  1. The biggest difference between the new convergence criterion and the old one is that the old one was relative to the size of the coefficient. In your example, we have two coefficients blowing up to absurd values: LDL is getting a hazard ratio of 300 while CHOL is being estimated with a hazard ratio of 200 in the protective direction. So relatively speaking, going from a coefficient of 6 to 6.02 is a small change, but in absolute terms it's still pretty big.
  2. My initial reaction is that I want ncvreg to fail here. There's a big problem with this model -- CHOL and LDL are basically the same thing, so it makes no sense to adjust one for the other, resulting in huge multicollinearity and clearly unrealistic estimates. The person running the program should absolutely be made aware of this. If I was analyzing this data set, my fix would be to set lambda.min=0.1 rather than increase max.iter. The data simply does not support reliable estimation for lambdas smaller than this.
  3. I'm rather reluctant to increase max.iter; sure, in this case, it's still fast for max.iter=5000, but for high-dimensional data, this is likely to result in enormous amounts of time being wasted simply to obtain meaningless estimates as in (2) above.
  4. That said, there's really just one problematic point in this coefficient path: the point where the model tries to include both LDL and CHOL and adjust one for the other. I've always checked the number of iterations separately for each value of lambda; alternatively, I could do what glmnet does and restrict only the total (cumulative sum) iterations over all lambdas. That would work here, as one lambda value requires ~4,000 iterations, but all the others converge pretty quickly, so it's not like there's a problem with the whole path, just the one value. Hmm, I don't know, I can see pros and cons, but this is something I might change in ncvreg 3.8-0.
nanxstats commented 7 years ago

Hi Patrick, thanks for the very concise analysis on this dataset. I totally agree with your point 2 particularly --- it is important to do "the right thing" and give users appropriate warnings instead of pretending everything is going well, so keeping max.iter = 1000 sounds reasonable to me.

4 raised a very good point on the warning itself. I could see that changing the checking strategy is a possible solution, or maybe just printing out some more informative messages about the issue, so users could get some clues on what actually happened and further dig into their data.