privefl / bigstatsr

R package for statistical tools with big matrices stored on disk.
https://privefl.github.io/bigstatsr/
179 stars 30 forks source link

Could I use `big_spLogReg()` for multi-class L1-regularized logistic regression? #153

Closed antonioggsousa closed 2 years ago

antonioggsousa commented 2 years ago

Hi!

I came across this package today. It seems a great package! Thank you for developing it.

I'm using LiblinearR package for a multi-class L1-regularized logistic regression task and I would like to test if I could use big_spLogReg() to speed it up and be more memory efficient.

Although if I understood correctly the function big_spLogReg() aims to perform L1 regularized logistic regression without being adapted to perform multi-class classification. Would this be easy to implement/adapt?

I tested this with the toy iris data set and it works really well but as far as I understood it works for binary classifications, 0 or 1.

# Packages
library("LiblineaR")
library("bigstatsr")

# Data
set.seed(1024)
rows.select <- sample(1:nrow(iris), round(0.7*nrow(iris)))
train <- iris[rows.select,]
test <- iris[-rows.select,]

# LiblinearR
set.seed(1024)
model <- LiblineaR(data=train[,-ncol(train)], target=train$Species, type=6)
test[,"predictions"] <- predict(model,test[,-ncol(test)])$predictions
head(test,15)
table(test$Species, test$predictions)

# bigstatsr
set.seed(1024)
X <- as_FBM(iris[,-ncol(iris)])
y01 <- ifelse(train$Species == "setosa", 1, 0)
res <- big_spLogReg(X, y01, ind.train=as.numeric(row.names(train)),
                    covar.train = NULL,
                    alphas = 1, warn = FALSE, K=2, 
                    ncores = 7)
test[, "new.predictions"] <- predict(res, X, ind.row =as.numeric(row.names(test)), covar.row =NULL)

Thank you!

Best regards,

António

privefl commented 2 years ago

I guess there are two strategies for multi-class prediction:

I don't know which one is used by default in {LiblineaR}, but that shouldn't be too hard to implement with a for loop I guess.

antonioggsousa commented 2 years ago

Hi @privefl,

Thank you for your answer. I can give it a try.

I was just afraid that I'll make things slower that way.

Can I disable cross-validation? At least in LiblineaR seems possible to do it, but here I tried to give 1 to K and I was unable to run it that way.

Again, thank you for your answer and great package.

António

privefl commented 2 years ago

No, it's not currently possible to disable the crossval.

antonioggsousa commented 2 years ago

Thank you.

This was what I've tried (one vs all) and it seems to even predict versicolor where LiblineaR is unable though it takes more time to run.

multi_LR <- function(data, r.train, r.test, preds.test) {
  set.seed(1024)
  X <- as_FBM(data)
  cm <- model.matrix(~0+target, data.frame("target"=preds.test))
  colnames(cm) <- levels(factor(preds.test))
  out <- lapply(1:ncol(cm), function(x) {
    y01 <- cm[,x]
    res <- big_spLogReg(X, y01, ind.train=r.train,
                        covar.train=NULL,
                        alphas=1, warn=FALSE, 
                        K=2, ncores=3)
    p <- predict(res, X, ind.row=r.test, covar.row=NULL)
    list("prediction"=p, "model"=res)
  })
  names(out) <- colnames(cm)
  return(out)
}

# test 
set.seed(1024)
multi.test <- multi_LR(data=iris[,-ncol(iris)], r.train=as.numeric(row.names(train)), 
                       r.test=as.numeric(row.names(test)), preds.test=train[,ncol(train)])

preds <- lapply(1:length(multi.test), function(x) multi.test[[x]]$prediction)
comb.preds <- do.call("cbind", preds)
colnames(comb.preds) <- names(multi.test)
privefl commented 2 years ago

Sounds interesting

What do you expect from me exactly?

antonioggsousa commented 2 years ago

I was just wondering if you had some suggestion to make it faster, just in case I was using it wrong.

I guess you can close the issue.

Thank you for your answers.

privefl commented 2 years ago

Is it really slow? What is the size of your data?

privefl commented 2 years ago

And how many models do you have to run? (i.e. how many classes do you have)

antonioggsousa commented 2 years ago

I didn't try big_spLogReg() with my data yet. I was just testing it to see how it looked like compared with LiblineaR which is what we're using.

We're running multi-class L1-regularized LR (using LiblineaR) for around 200 epochs with matrices with several thousands or rows and columns, e.g., 13,000 x 10,000. For each epoch there are around 15 classes.

I was just trying to find a way to speed this up. I don't know if using big_spLogReg() instead of LiblineaR would require less epochs and the overall running time would be lower.

Anyway thank you for your time and interest.

P.S.: when I said that was slow, I was comparing that with LibLinear for the toy iris data, but this is also probably due to the cross-validation step. Although I didn't test this for bigger/realistic data sets where it may scale better and overcome LiblineaR. I was just trying to get some advice, before trying to implement this.

privefl commented 2 years ago

What are you calling "epochs" here exactly?

To make this fast enough, I would parallelize, e.g. the loop on the 15 classes (e.g. with foreach). You can also enable parallelization within the function over the K folds.

Depending on the exact size of your data, I would go for something like

antonioggsousa commented 2 years ago

By epoch I mean each time that the LR model is run again. We want to used the LR ability to learn and find important proprieties from the data. But every time that is run is using a slightly different train set.

Thank you for your valuable advice and recommendations.

I'll follow them.