L0 and L0L2 results from the CDPS(1) method are confusing

vtshen commented 6 years ago

The L0 and L0L2 results from the CDPSI(1) are confusing to me. It could be either my usage of the package or my poor understanding of the L0 and L0L2 methods. So I am posting to ask for some help in order to avoid mistakenly using the package.

First, please correct me if I was wrong that I used the arguments, algorithm="CDPSI" and maxSwaps=1, to run the CDPSI(1) mentioned in the paper.

fit_l0 = L0Learn.fit(x, y, penalty="L0", algorithm="CDPSI", nLambda=50, maxIters = 1e4, maxSwaps=1, maxSuppSize = max(50,ncol(x),nrow(x)))

The graphs below are about the comparison of nonzero coefficients and risk errors from the L0 and L0L2 method in the package. The parameters are tuned based on validation data sets.

In that case setting, the plots show that L0L2 has larger risk error than L0, and L0L2 sometimes could have smaller nonzero coefficient numbers than L0. But why does L0L2 have smaller nonzero coefficient numbers?

(plots updated based on comments)

rplot

R code to generate the plots, (updated based on comments)

# https://github.com/hazimehh/L0Learn/issues/31
library(L0Learn)

compute_risk = function(betahat, beta_ture, Sigma){
  betahat0 = betahat[1,]
  betahat = betahat[-1,]

  delta = betahat - beta_ture
  risk = diag(t(delta) %*% Sigma %*% delta)
  risk = risk + betahat0^2
  return (risk)
}
l0l2_nonzero_path = l0_nonzero_path = l0l2_risk_path = l0_risk_path = c()

rep = 30

for (i in 1:rep) {
  # simulate x and y
  nval = n = 100
  p = 10
  rho = 0.35
  snr=3;
  s=5
  x = matrix(rnorm(n*p),n,p)
  xval = matrix(rnorm(n*p),n,p)

  # Introduce autocorrelation
  if (rho != 0) {
    inds = 1:p
    Sigma = rho^abs(outer(inds, inds, "-"))
    obj = svd(Sigma)
    Sigma.half = obj$u %*% (sqrt(diag(obj$d))) %*% t(obj$v)
    x = x %*% Sigma.half
    xval = xval %*% Sigma.half
  }

  beta = rep(0,p)
  s = min(s,p)
  beta[1:s] = 1
  # set snr
  vmu = as.numeric(t(beta) %*% beta)
  sigma = sqrt(vmu/snr)
  y = as.numeric(x %*% beta + rnorm(n)*sigma)
  yval = as.numeric(xval %*% beta + rnorm(nval)*sigma)

  # fit_L0
  fit_l0 = L0Learn.fit(x, y, penalty="L0", algorithm="CDPSI", nLambda=50, maxIters = 1e4, maxSuppSize = min(50,ncol(x),nrow(x)))
  fit_l0$lambda
  fit_l0$converged
  betahat_l0 = as.matrix(coef(fit_l0))
  pred = predict(fit_l0, xval)
  error_val_l0 = colMeans(as.matrix((pred - matrix(yval,ncol=1))^2))

  id = which.min(error_val_l0) # selection based on validation error

  nonzero = colSums(betahat_l0!=0)[id]
  l0_nonzero_path = c(l0_nonzero_path, nonzero)

  risk_val_l0 = compute_risk (betahat = betahat_l0, beta_ture = beta, Sigma)
  l0_risk_path = c(l0_risk_path, risk_val_l0[id])

  # fit_L0L2
  # nGamma = 10, gammaMax = 10, gammaMin = 1e-04,
  fit_l0l2 = L0Learn.fit(x, y, penalty="L0L2", algorithm="CDPSI", nLambda=50, maxIters = 1e4, maxSuppSize = min(50,ncol(x),nrow(x)))
  betahat_l0l2 = as.matrix(coef(fit_l0l2))
  pred_l0l2 = predict(fit_l0l2, xval)
  error_val_l0l2 = colMeans(as.matrix((pred_l0l2 - matrix(yval,ncol=1))^2))

  id = which.min(error_val_l0l2) # selection based on validation error

  nonzero = colSums(betahat_l0l2!=0)[id]
  l0l2_nonzero_path = c(l0l2_nonzero_path, nonzero)

  risk_val_l0l2 = compute_risk (betahat = betahat_l0l2, beta_ture = beta, Sigma)
  l0l2_risk_path = c(l0l2_risk_path, risk_val_l0l2[id])
}

# results
par(mfrow=c(1,2))
plot(l0_risk_path, ylim = c(min(l0_risk_path, l0l2_risk_path), max(l0_risk_path, l0l2_risk_path))
     , ylab = "Risk Error", xlab = "index of replication, L0 in blue circle, L0L2 in red cross"
     , main = "parameter selected based on minimum validation error", col = "blue")
points(l0l2_risk_path, pch = 3, col = "red" )

plot(l0_nonzero_path, ylim = c(min(l0_nonzero_path, l0l2_nonzero_path), max(l0_nonzero_path, l0l2_nonzero_path))
     , ylab = "Nonzero coefficient", xlab = "index of replication, L0 in blue circle, L0L2 in red cross"
     , main = "parameter selected based on minimum validation error", col = "blue")
points(l0l2_nonzero_path, pch = 3, col = "red" )

rahulmaz commented 6 years ago

It seems that you have not specified the Gamma (i.e., the regularization parameter for L2) when you use L0L2 penalty in your function call:

fit_l0l2 = L0Learn.fit(x, y, penalty="L0L2", algorithm="CDPSI", nLambda=50, maxSwaps = 1, maxIters = 1e4, maxSuppSize = max(50,ncol(x),nrow(x))) betahat_l0l2 = as.matrix(coef(fit_l0l2))

Please have a look at the Vignette:

http://www.mit.edu/~hazimeh/L0Learn-vignette.html

on how to specify this or use the function properly.

L0 and L0L2 (with Gamma = 0) should return similar solutions on the same grid of lambda values. Since the choice of lambda values is chosen by default, you may want to override it by specifying your own choice of lambda values (see the Vignette on how to do this).

Usually, a positive regularization for the L2 term (in L0L2) will lead to additional shrinkage so the coefficient values will be shrunk when compared to those without L2 shrinkage (assuming they lead to the same support).

hazimehh commented 6 years ago

Also to add to Rahul's comment, L0Learn currently only supports CDPSI(1) (no higher order swaps are supported). The parameter maxSwaps is an upper bound on the number of iterations performed by the swapping algorithm (specifically, in the paper, this is the maximum number iterations performed by the for loop in Algorithm 2). The current default value for maxSwaps is 100, and we don't recommend setting it to lower values since small values like 1 will not typically lead to CDPSI(1) minima..

vtshen commented 6 years ago

HI, Prof Rahul Mazumder and Hussein, thanks so much for the quick response. Now I am using the code below to run L0L2 method. Now the plots look better.

Please correct me if I was wrong, the reason that I did not specify gamma is because I thought I could use the default values, which are: nGamma = 10, gammaMax = 10, gammaMin = 1e-04

fit_l0l2 = L0Learn.fit(x, y, penalty="L0L2", algorithm="CDPSI", nLambda=50, maxIters = 1e4, maxSuppSize = min(50,ncol(x),nrow(x)))

I have more settings and I will investigate those cases.

hazimehh commented 6 years ago

That's great, and thanks for the feedback! The gamma parameter range for l2 regularization is very problem dependent. It might be a good idea to start with a wide range for gamma and plot the cross-validation errors. Based on the shape of the plot, you can select a narrower range and rerun cross-validation on this new range. The default values of gamma supplied in the package are (generally) definitely not optimal for a given problem -- I will make this point clearer in the next update to the vignette.

vtshen commented 5 years ago

Another question about the values of gamma, if in my case the cross-validation is not convenient to determine the range of gamma.

Please feel free to correct me that I am assuming that inside the package code, the value of gamma is not adjusted/scaled based on the data size n.

Then if I have two choices of candidates to be used as the maximum value of gamma: max(abs(xty))/nrow(x) and max(abs(xty)), which one do you think is more reasonable?

hazimehh commented 5 years ago

Before fitting, the package normalizes both y and the columns of X to have unit l2 norm. The values of lambda/gamma used (i.e., those specified by the user or selected by the package) are with respect to the normalized data.

hazimehh / L0Learn

L0 and L0L2 results from the CDPS(1) method are confusing #31