tdhock / necromass

1 stars 0 forks source link

fusion reading list #4

Open tdhock opened 3 months ago

tdhock commented 3 months ago

https://arxiv.org/pdf/1611.00953.pdf L1 on weights + L2 squared fusion between all groups https://cloud.r-project.org/web/packages/fuser/vignettes/subgroup_fusion.html

https://rdrr.io/cran/genlasso/man/fusedlasso.html possible to implement L1 on weights + L1 fusion between pairs of weights in different groups, if we create large matrix X with lots of 0 (maybe tricky to code the graph correctly)

?? Guillaume Obozinski, Ben Taskar, and Michael I Jordan. Joint covariate selec- tion and joint subspace selection for multiple classification problems. Statistics and Computing, 20(2):231–252, 2010 ??

try implementing new learner in mlr3 like this https://github.com/mlr-org/mlr3learners/blob/main/R/LearnerRegrKKNN.R with auto_tuner https://mlr3book.mlr-org.com/chapters/chapter4/hyperparameter_optimization.html#sec-autotuner

@EngineerDanny

tdhock commented 3 months ago

please read my blog about mlr3 hyper-parameter auto_tuner https://tdhock.github.io/blog/2024/hyper-parameter-tuning/

EngineerDanny commented 2 months ago

I have created a prototype fuser learner here -> https://github.com/EngineerDanny/necromass/blob/main/LearnerRegrFuser.Rmd There are some questions that need to be answered for it to be fully compatible with the framework.

EngineerDanny commented 2 months ago

Other public microbiome datasets with different groups -> https://github.com/twbattaglia/MicrobeDS

tdhock commented 2 months ago

great start for LearnerRegrFuser. I would suggest already sending an issue to the fuser package authors, to tell them you are working on an mlr3 interface, and maybe ask to make sure there is not already one implemented elsewhere? eventually it would be good to move that code from Rmd to an R package, you can submit to CRAN, maybe named mlr3fuser, similar to https://github.com/tdhock/mlr3resampling

EngineerDanny commented 1 month ago

@tdhock I have opened the issue about the fuser for mlr3 here. I have been trying to fix this issue but so far no success. Maybe you can help with the auto-tuner?

Error: <LearnerRegrFuser:regr.fuser> cannot be trained with TuneToken 
present in hyperparameter: lambda

When I run this:

if(require(future))plan("multisession")
bench.result <- mlr3::benchmark(bench.grid, store_models = TRUE)

This is the instance of the class after applying the tuner on lambda.

<LearnerRegrFuser:regr.fuser>: Fuser
* Model: -
* Parameters: lambda=<RangeTuneToken>, gamma=0.01, tol=9e-05, num.it=5000, intercept=TRUE, scaling=FALSE
* Packages: mlr3, mlr3learners, fuser
* Predict Types:  [response]
* Feature Types: logical, integer, numeric
* Properties: -

UPDATE: I have been able to fix it. It had to do with the fuser object which if I want to apply the auto-tuner, it has to be a mlr3tuning::auto_tuner object instead in the learner list.

EngineerDanny commented 1 month ago

@tdhock I have this issue, could you help?

Error: Cannot combine stratification with grouping

This is the R code:

N <- 300
abs.x <- 20
set.seed(1)
x.mat <- matrix(runif(N * 3, -abs.x, abs.x), ncol = 3)  # Ensure X has more than two features
colnames(x.mat) <- paste0("feature", 1:3)

library(data.table)
(task.dt <- data.table(
  x = x.mat,
  y = sin(rowSums(x.mat)) + rnorm(N, sd = 0.5)
))

# Create a grouping variable
task.dt[, sample_group := rep(1:3, length.out = .N)]

# Check the distribution of groups
table(group.tab <- task.dt$sample_group)

# Create a regression task with the grouping variable
reg.task <- mlr3::TaskRegr$new("sin", task.dt, target = "y")
group.task <- reg.task$set_col_roles("sample_group", c("group", "stratum"))

same_other_cv <- mlr3resampling::ResamplingSameOtherCV$new()
same_other_cv$param_set$values$folds <- 2

fuser.learner = lrn("regr.fuser")
#fuser.learner$param_set$values$num.it <- paradox::to_tune(1, 100)
fuser.learner$param_set$values$lambda <- paradox::to_tune(0.001, 1, log=TRUE)
#fuser.learner$param_set$values$gamma <- paradox::to_tune(0.001, 1, log=TRUE)
subtrain.valid.cv <- mlr3::ResamplingCV$new()
subtrain.valid.cv$param_set$values$folds <- 2
grid.search.5 <- mlr3tuning::TunerGridSearch$new()
grid.search.5$param_set$values$resolution <- 5
fuser.learner.tuned = mlr3tuning::auto_tuner(
  tuner = grid.search.5,
  learner = fuser.learner,
  resampling = subtrain.valid.cv,
  measure = mlr3::msr("regr.mse"))
reg.learner.list <- list(
  mlr3::LearnerRegrFeatureless$new(), fuser.learner.tuned)

(same.other.grid <- mlr3::benchmark_grid(
  group.task,
  reg.learner.list,
  same_other_cv))

if(require(future))plan("multisession") 
bench.result <- mlr3::benchmark(same.other.grid, store_models = TRUE)
EngineerDanny commented 1 month ago

@tdhock Another issue that I faced while I was isolating just the lrn("regr.fuser") class was all and other works fine but same does not work because it can't train when there is only one group. The exact error was Error in G[i, j] : subscript out of bounds. In the library, G is a k by k (number of groups) matrix which controls the amount of information sharing between the groups. Essentially I think only all is useful in the fuser package because of the way it works.

My question is how do I specify in mlr3resampling::ResamplingSameOtherCV$new() to run for say only all and other.

tdhock commented 1 month ago

"cannot combine stratification with grouping" comes from using mlr3::ResamplingCV which does not support both, even though your task defines both, so to work-around that I had to fork that code and remove the error message, so please try the code in this branch https://github.com/tdhock/mlr3resampling/pull/8

EngineerDanny commented 1 month ago

He has an issue here https://github.com/FrankD/fuser/issues/1 that when there is no information sharing because there is only one group it should default to the normal LASSO.

EngineerDanny commented 1 month ago

I get the error below when I use the mlr3tuning::auto_tuner. It works fine when I use the normal LearnerRegrFuser class.

Error in benchmark_grid(self$task, self$learner, resampling, param_values = list(xss)) : 
  A Resampling is instantiated for a task with a different number of observations
tdhock commented 1 month ago
remotes::install_github("tdhock/mlr3resampling@cv-ignore-group")
EngineerDanny commented 1 month ago

This is the first run of the fuser on the necromass data. fuser does not perform better than featureless in some cases. I think there is a bug in the code, maybe the implementation of fuser?. I am yet to find out the issue.

fuser_results

EngineerDanny commented 4 weeks ago

To address the consistent error issue above,

necromass fuser_necromass_results_1

moving_pictures (publicly available data set) fuser_moving_pictures_results_1

tdhock commented 4 weeks ago

by the way I updated mlr3resampling on CRAN, you may want to update and read https://cloud.r-project.org/web/packages/mlr3resampling/vignettes/ResamplingSameOtherSizesCV.html

EngineerDanny commented 2 weeks ago

@tdhock I cant seem to find my way around this fuser algorithm. I have fixed the issue with the indexing but it still doesn't seem to do better than featureless in most cases. Sometimes, it does extremely better (Second figure). I used autotuner with fuser, specifically with RandomSearch because the GridSearch was taking very long:

fuser.learner =  LearnerRegrFuser$new()
fuser.learner$param_set$values$lambda <- paradox::to_tune(0.001, 1, log=TRUE)
fuser.learner$param_set$values$gamma <- paradox::to_tune(0.001, 1, log=TRUE)
fuser.learner$param_set$values$tol <- paradox::to_tune(1e-10, 1e-2, log=TRUE)

These are the results on three public datasets. What do you think about this:

moving_pictures tuned_moving_pictures_results_6

hmpv13 hmpv13_results

hmpv35 hmpv35_results

Still investigating this issue, could be that there is an issue with the actual fuser implementation. Maybe I am not using a larger range of hyper-parameters for the cross-validation.

tdhock commented 2 weeks ago

you should take default value for tol (not tuned) lambda and gamma ranges look reasonable but you should check to see if you are selecting the largest or smallest values. and maybe compare to the lambda/penalty value that glmnet selects.

EngineerDanny commented 2 weeks ago

This is the truth response graph for the boston housing dataset.

Fuser 000024

CVGlmnet 00001a