Solution of corrected Lasso is not so well for data with small measurement error

YuxiaoRuoyao commented 1 year ago

Hi,

Thanks for package and method developing. I'm trying to use this package to do variable selection by corrected Lasso. For my data, the measurement error is pretty small about 1e-4 to 1e-6. But I found the coefficient estimates by corrected_lasso was far from the true value. And also the cv_corrected_lasso cannot did a good job on finding the best radius.

For instance, I created the example data and ran it by the following code:

create_example_data <- function(n, p,sdX = 0.01, sdU = 1e-4, 
                                sdEpsilon = 0.01, family = "gaussian") {
  # Independent true covariates with mean zero and standard deviation sdX
  X <- matrix(rnorm(n * p, sd = sdX), nrow = n, ncol = p)
  # if Gaussian, response has standard deviation sdEpsilon, and zero intercept
  # if binomial, response is binomial with mean (1 + exp(-X %*% beta))^(-1)
  beta <- c(1,0.5)

  if(family == "gaussian") {
    # True coefficient vector has s non-zero elements and p-s zero elements
    y <- X %*% beta + rnorm(n, sd = sdEpsilon)  
  } else if (family == "binomial") {
    # Need an amplification in the binomial case
    beta <- beta * 3
    y <- rbinom(n, size = 1, prob = (1 + exp(-X %*% beta))**(-1))
  }

  # The measurements W have mean X and standard deviation sdU. 
  # We assume uncorrelated measurement errors
  W <- X + matrix(rnorm(n * p, sd = sdU), nrow = n, ncol = p)

  return(list(X = X, W = W, y = y, beta = beta, sigmaUU = diag(p) * sdU))  
}
n <- 170
p <- 2
set.seed(1000)
ll <- create_example_data(n, p)
corrected_fit <- corrected_lasso(W = ll$W, y = ll$y, sigmaUU = ll$sigmaUU)
plot(corrected_fit,type="path")
corrected_fit$betaCorr

The results showed the estimated betas were about 0.0001 at the largest radius point, which definitely was not a correct answer since the true values are 1 and 0.5. The radius in this case was from 0.003 to 3.131 and since the L1 norm of true beta is 1.5, actually we'll expect it could find the optimal solution. So I'm quite confused about why this happens and what is the problem. Does the project gradient algorithm cannot do a good job in this case?

Besides, I found if I make some scaling of the data like:

corrected_lasso(W = 100*ll$W, y = 100*ll$y, sigmaUU = 100*ll$sigmaUU)$betaCorr

After multiplying 100 to the data, the estimates look like more reasonable. Is that a possible solution? Thanks for any suggestions!

Best, Ruoyao

osorensen commented 1 year ago

Thanks for notifying me about this, @YuxiaoRuoyao. It does indeed look like a possible bug.

As an additional check, I tested standard linear regression with the data you generated, and it gave sensible coefficient estimates:

> # Linear regression, for comparison
> coef(lm(y ~ W, data = ll))
  (Intercept)            W1            W2 
-0.0001733668  1.0844253741  0.4923043563

The corrected_lasso should not be far from this in such a simple test case. I will look into this over the coming weeks, and keep you updated in this issue.

YuxiaoRuoyao commented 1 year ago

Thank you very much! It will be really helpful if this problem could solve for my work. Let me know if there is any update.

osorensen commented 1 year ago

Here is an even simpler example reproducing the issue without any measurement error:

library(hdme)
set.seed(1)
n <- 100
p <- 3
beta <- c(1, 0, 0)
X <- matrix(runif(n * p), nrow = n, ncol = p)
y <- X %*% beta + rnorm(n, sd = .01)

# Large enough radius to avoid any penalization
fit <- corrected_lasso(X, y, matrix(0, nrow = p, ncol = p), radii = 1000)
fit$betaCorr
#>             [,1]
#> [1,] 0.548153232
#> [2,] 0.003790501
#> [3,] 0.052759569

coef(lm(y ~ 0 + X))
#>           X1           X2           X3 
#>  1.000122470 -0.001765039  0.001786653

^{Created on 2023-05-10 with reprex v2.0.2}

osorensen commented 1 year ago

The error is caused by this commit. The fix in the commit itself is correct, but since the tolerance parameter is hard-coded to 1e-4, the algorithm is very far from converging.

osorensen commented 1 year ago

@YuxiaoRuoyao, thanks again for discovering this bug! It is now fixed on the Develop branch. I will fix the rest and upload a corrected version to CRAN within a few days.

osorensen commented 1 year ago

Sorry, I got some issues with my GitHub setup, so cannot push the fix yet and I have to run now, but will fix it tonight.

YuxiaoRuoyao commented 1 year ago

Thank you very much! I'll try fixed version on my data and update the result with you!

osorensen commented 1 year ago

On the current develop branch, the example by @YuxiaoRuoyao now runs as follows, returning correct values:

library(hdme)
create_example_data <- function(n, p,sdX = 0.01, sdU = 1e-4, 
                                sdEpsilon = 0.01, family = "gaussian") {
  # Independent true covariates with mean zero and standard deviation sdX
  X <- matrix(rnorm(n * p, sd = sdX), nrow = n, ncol = p)
  # if Gaussian, response has standard deviation sdEpsilon, and zero intercept
  # if binomial, response is binomial with mean (1 + exp(-X %*% beta))^(-1)
  beta <- c(1,0.5)

  if(family == "gaussian") {
    # True coefficient vector has s non-zero elements and p-s zero elements
    y <- X %*% beta + rnorm(n, sd = sdEpsilon)  
  } else if (family == "binomial") {
    # Need an amplification in the binomial case
    beta <- beta * 3
    y <- rbinom(n, size = 1, prob = (1 + exp(-X %*% beta))**(-1))
  }

  # The measurements W have mean X and standard deviation sdU. 
  # We assume uncorrelated measurement errors
  W <- X + matrix(rnorm(n * p, sd = sdU), nrow = n, ncol = p)

  return(list(X = X, W = W, y = y, beta = beta, sigmaUU = diag(p) * sdU))  
}
n <- 170
p <- 2
set.seed(1000)
ll <- create_example_data(n, p)
corrected_fit <- corrected_lasso(W = ll$W, y = ll$y, sigmaUU = ll$sigmaUU)
#> [1] "Max iterations"
#> [1] "Max iterations"
#> [1] "Max iterations"
#> [1] "Max iterations"
#> [1] "Max iterations"
#> [1] "Max iterations"
#> [1] "Max iterations"
#> [1] "Max iterations"
#> [1] "Max iterations"
#> [1] "Max iterations"
#> [1] "Max iterations"
#> [1] "Max iterations"
#> [1] "Max iterations"
#> [1] "Max iterations"
#> [1] "Max iterations"
#> [1] "Max iterations"
#> [1] "Max iterations"
#> [1] "Max iterations"
#> [1] "Max iterations"
plot(corrected_fit,type="path")

corrected_fit$betaCorr
#>             [,1]       [,2]       [,3]      [,4]      [,5]      [,6]      [,7]
#> [1,] 0.003131308 0.05575060 0.10842009 0.1611412 0.2139154 0.2667442 0.3196289
#> [2,] 0.000000000 0.02931344 0.05840531 0.0872748 0.1159211 0.1443433 0.1725407
#>           [,8]      [,9]     [,10]     [,11]     [,12]     [,13]     [,14]
#> [1,] 0.3725711 0.4255723 0.4786340 0.5317576 0.5849447 0.6381968 0.6915155
#> [2,] 0.2005123 0.2282573 0.2557749 0.2830640 0.3101239 0.3369537 0.3635523
#>          [,15]     [,16]     [,17]     [,18]     [,19]     [,20]
#> [1,] 0.7449024 0.7983589 0.8518866 0.9054872 0.9591623 1.0129134
#> [2,] 0.3899190 0.4160527 0.4419525 0.4676174 0.4930465 0.5182388

^{Created on 2023-05-10 with reprex v2.0.2}

osorensen commented 1 year ago

The warnings about "Max iterations" are a bit worrying, and in addition there seems to be something wrong with the GLM non-Gaussian case, so I keep this issue open and will fix these issues before submitting an update to CRAN.

osorensen commented 1 year ago

@YuxiaoRuoyao, you probably know this, but you can test the update package by installing it with the following command:

# install.packages("devtools")
devtools::install_github("osorensen/hdme", build_vignettes = TRUE)

osorensen commented 1 year ago

Closing this one and opening a new issue for GLM.

YuxiaoRuoyao commented 1 year ago

@osorensen Hello, I'm trying to test the new version on my data. I find the cross-validation part still has some issues. Basically the loss does not converge to the minimal so the selected lambda is also not correct. The following example is basically the same with previous one but I added two null variables here.

create_example_data <- function(n, p,sdX = 0.01, sdU = 1e-4, 
                                sdEpsilon = 0.01, family = "gaussian") {
  # Independent true covariates with mean zero and standard deviation sdX
  X <- matrix(rnorm(n * p, sd = sdX), nrow = n, ncol = p)
  # if Gaussian, response has standard deviation sdEpsilon, and zero intercept
  # if binomial, response is binomial with mean (1 + exp(-X %*% beta))^(-1)
  beta <- c(1,0.5,0,0)

  if(family == "gaussian") {
    # True coefficient vector has s non-zero elements and p-s zero elements
    y <- X %*% beta + rnorm(n, sd = sdEpsilon)  
  } else if (family == "binomial") {
    # Need an amplification in the binomial case
    beta <- beta * 3
    y <- rbinom(n, size = 1, prob = (1 + exp(-X %*% beta))**(-1))
  }

  # The measurements W have mean X and standard deviation sdU. 
  # We assume uncorrelated measurement errors
  W <- X + matrix(rnorm(n * p, sd = sdU), nrow = n, ncol = p)

  return(list(X = X, W = W, y = y, beta = beta, sigmaUU = diag(p) * sdU))  
}
n <- 170
p <- 4
set.seed(1000)
ll <- create_example_data(n, p)
test_cv<-cv_corrected_lasso(ll$W, ll$y, ll$sigmaUU)
plot(test_cv)
coef(corrected_lasso(ll$W,ll$y,ll$sigmaUU, radii = test_cv$radius_1se))
corrected_lasso(ll$W,ll$y,ll$sigmaUU)$betaCorr

I see the plot for cross-validation result is very weird and also the result by radius_1se from CV is not reasonable. But directly using corrected_lasso gives a right answer. So do you think it's reasonable for me to use the result of corrected_lasso on the last radii if not using this cross-validation procedure? But it cannot give non-zero coefficients since we have two null variables inside.

Thank you!

osorensen commented 1 year ago

Hi @YuxiaoRuoyao. Thanks for testing!

On the development version, I get this plot when running your code above, and I think it makes sense:

The loss is minimized at the largest possible radius, which implies minimal constraints. Thinking in terms of lambda for the standard lasso, a high radius is the same as a small lambda.

My guess is that corrected_lasso(ll$W,ll$y,ll$sigmaUU)$betaCorr gives the right answer because it runs through 100 radii, starting from 0, and hence it's easier for the algorithm to converge. On the other hand, corrected_lasso(ll$W,ll$y,ll$sigmaUU, radii = test_cv$radius_1se) uses only a single radius, and hence the algorithm does not converge.

We can confirm this with the following line, where I manually specify that it should run longer. The results now make sense:

> coef(corrected_lasso(ll$W,ll$y,ll$sigmaUU, radii = test_cv$radius_1se, maxits = 100000))
[1] "Max iterations"
Non-zero coefficients:
 coefficient    estimate
           1  1.07256125
           2  0.48573255
           3 -0.02697419
           4  0.07091974

So I think you'll have two test different parameters for the algorithms, to make sure the algorithms converge.

Unfortunately it's a very hard task to automate the choice of parameters for the algorithms, so until further this is something the user has to set. I recommend the following papers for understanding how the algorithm works:

Duchi, J., Shalev-Shwartz, S., Singer, Y. and Chandra, T. (2008). Efficient projections onto the £i-ball for learning in high dimensions. Proceedings of the 25th International Conference on Machine Learning, 272-279.

Loh, P.-L., & Wainwright, M. J. (2012). HIGH-DIMENSIONAL REGRESSION WITH NOISY AND MISSING DATA: PROVABLE GUARANTEES WITH NONCONVEXITY. The Annals of Statistics, 40(3), 1637–1664. http://www.jstor.org/stable/41713688

Sørensen, Ø., Frigessi, A., & Thoresen, M. (2015). MEASUREMENT ERROR IN LASSO: IMPACT AND LIKELIHOOD BIAS CORRECTION. Statistica Sinica, 25(2), 809–829. http://www.jstor.org/stable/24311046

YuxiaoRuoyao commented 1 year ago

@osorensen Thanks for replying. Basically the problem is that the algorithm sometimes could not converge so we need to input a pretty large iteration number but the time will take longer. Also I see the warning of "Max iterations" is because the default number of iteration is 5000 but apparently the algorithm does not converge in 5000 times with a smaller tolerance. So for my problem, I just need to select a large parameter for maxits to let the algorithm converge.

Is my understanding correct? if not I appreciate any other suggestions. Thank you!

osorensen commented 1 year ago

Yes, @YuxiaoRuoyao, I think your understanding is correct.

osorensen commented 1 year ago

@YuxiaoRuoyao, the updated version of the package is now on CRAN. I also wanted to add that if you find out any good ways of solving these convergence issues, that would be very valuable, and you are welcome to create a pull request.

Thanks again for rasing these issues!

YuxiaoRuoyao commented 1 year ago

@osorensen, thanks for the notifying. Does the updated CRAN version is the same with the previous development version?

Since I need to pick up a proper radii, it's necessary to do cross-validation. I'm still wondering that does the cv_correct_lasso could give us reliable loss results under no converging for each fold? (since I see it will report "max iterations" even when I input a very large maxits number). And also the running time for this step is so long. Is there any way to help speed up this step? I'm trying 5 fold and 20 radii but just worrying about it will bring bias to the results.

Thank you!!

Best, Ruoyao

osorensen / hdme

Solution of corrected Lasso is not so well for data with small measurement error #47