longhaiSK / HTLR

Bayesian Logistic Regression with Hyper-LASSO priors
https://longhaisk.github.io/HTLR
GNU General Public License v3.0
9 stars 1 forks source link

Explanation of initial state parameter? #4

Open GabeAl opened 4 years ago

GabeAl commented 4 years ago

Hello,

I'd like to use this package with extremely high-dimensional datasets, which aren't supported by glmnet because of its 4gb integer/array size limitation. Therefore I want to know how I can specify the initial state ("init" parameter of htlr() ) for the markov chain, and what format this variable can take.

For example, I would like to use the biglasso package on millions of features and tens of thousands of samples, which is trivial for my system with its terabytes of RAM and hundreds of threads. But of course impossible for glmnet, which uses very old fortran .Call bindings!

I also want to experimentally select subsets of variables to start with using other variable selection techniques like partial SIS, etc -- but not to restrict analysis to those features, just to initialize the chain. Random initialization takes a very long time and sometimes does not lead to a stable result, but with lasso's initialization it takes a few seconds, and each iteration happens in less than one second.

Thanks!

longhaiSK commented 4 years ago

We did not think about making HTLR be superior than glmnet in handling datasets of bigger size. One way to start the MCMC in HTLR is to use lasso estimate. MCMC method indeed will require more memory than glmnet. However, this is a direction that we will work on in the future. You can try biglasso for handling very big datasets.

Using SIS to downsize the dataset is a good idea. But be mindful of the feature selection bias problem. One needs to re-select features using SIS in each fold of cross-validation for avoiding false discovery. More details are given in this thesis: https://math.usask.ca/~longhai/researchteam/theses/DONG-THESIS-2019.pdf

GabeAl commented 4 years ago

Thanks @longhaiSK !

This is all great advice. I'll avoid using it for ultra-high dimensional datasets, then.

I am still wondering if you might be able to provide a description of what the "init" parameter wants to see, other than the string. Can I provide other initial states?

Thanks again!

syumet commented 4 years ago

Yes you can. For your reference, you can take a look at our code to generate Lasso initial states first. We will come up with a clear description in the next release.

About the dimensionality: To be honest we haven't tried the datasets of that size, but technically the limitation of memory allocation would not be a problem as our module is written in C++. You may give a try once you have your initial state ready, we look forward your feedback!

Best regards.

longhaiSK commented 4 years ago

Hi GabeAI

You can simply supply init with a matrix of regression coefficients found with another function. We take the form of (p+1)*K matrix, where p is the number of features, and K is the number of classes in y - 1. When K = 1, you can also supply a vector of p+1 long. We will update the document about this.