jaredhuling / fastglm

Fast glm fitting via RcppEigen
https://jaredhuling.org/fastglm/
57 stars 15 forks source link

Support for sparse covariate matrices & Eigen solvers for sparse systems? #13

Open tomwenseleers opened 2 years ago

tomwenseleers commented 2 years ago

Thanks for great package - I was wondering if it would also be possible to support sparse covariate matrices, i.e. matrices of class dgCMatrix(where x is first made sparse using

library(Matrix)
x <- Matrix(x, sparse=TRUE)

And if matching Eigen solvers could be allowed, https://eigen.tuxfamily.org/dox/group__TopicSparseSystems.html, with methods like LeastSquaresConjugateGradient (solver works for both dense & sparse matrices, and also allows for warm starts) and SparseQR?

jaredhuling commented 2 years ago

This should definitely be a possibility -- this would likely be a good bit of work, but I see no particular reason why it would be infeasible.

tomwenseleers commented 2 years ago

Great to hear! I was asking this also in the context of a package I was doing for L0-norm penalized GLMs, ie best subset selection, where I am approximating the L0 norm via an iterative adaptive ridge procedure - all this requires is a fast least squares solver for the IRLS weighted least squares solve step, where I obtain a ridge penalty by using a covariate matrix row augmented with a diagonal matrix with sqrt(lambdas) along the diagonal and augmenting y with p zeros (or using the Woodbury identity when p > n). There I tried using Eigen's least squares conjugate gradient solver for a sparse system, and that gave a ca 200 fold speed increase if I recall well, https://github.com/tomwenseleers/L0adri/blob/main/L0adri/R/linearSolver.R. I now moved development to https://github.com/tomwenseleers/L0glm, but still have to implement a choice of solvers there - if I could rely on a fast least squares solver from your package this might be more elegant perhaps... We also used the osqp quadratic programming solver for the case where we would like to impose nonnegativity or box constraints in our GLM (e.g. to be able to fit identity link Poisson GLMs) - though with the ridge penalties just clipping negative values also appeared to work well to achieve nonnegativity constraints. Let me know if you might be interested to work together on this project - I now have a bioinformatics Ma student working on this the coming semester & we were getting quite good performance compared to glmnet & LASSO and so on (far fewer false positives, i.e. much better specificity & similar predictive performance & reasonable speed). And best subset selection for large GLMs, including for high dimensional settings with p > n, with optional parameter constraints & inference provided via optional bootstrapping should be of general interest I think...