Stabl's performance on low-dimensional datasets

kelvinmo0513 commented 5 months ago

Hi - Is there a recommended range of number of input features for processing with Stabl? I am working with a clinical dataset that includes around 200 features after one-hot encoding the categorical variables. Finding an optimal FDR threshold for this classifier has proven challenging. Do you have any suggestions on how to streamline the feature selection process in this scenario?

Screenshot 2024-06-22 at 5 52 55 PM

xavdurand commented 5 months ago

Hello @kelvinmo0513 ,

A value of the FDR estimate greater than 1 means that you are extracting more artificial features that real features during the selection process. This is not correlated with the number of input variables, Stabl can handle inputs with arbitrary high number of features and with a few number of samples. In your case, I think the problem is coming from the high number of binary feature (one-hot encoding of categorical variables). Few possible solutions:

1) How do you generate artificial feature? Do you use knockoff? If so, change it to random_permutation, which might be better. knockoff type considers that features are gaussian, which is not really the case for binary features. 2) Categorical features can contain information in itself but not in a one-hot encoded version: the information might be diluted. Is there an order in the value of your categorical variable? Can we separate the feature into multiple groups containing order? It could be interesting to thing of the transformation of the categorical variable into an ordinal variable.

I hope it will help you, Kind regards,

kelvinmo0513 commented 5 months ago

Hi @xavdurand ,

Thank you for your response. It's very helpful to know that!

On the other hand, I was wondering if there is a function/attribute of the Stabl class to print out all the features as well as its associated coefficients in a dictionary format so I can see the direction of how each feature is affecting the outcome?

Thank you so much!

Best, Kelvin

xavdurand commented 5 months ago

Hi @kelvinmo0513 ,

Stabl is used to select features. You can extract the selected features using the get_support function. As Stabl is used only to select variables, it is not designed to predict the outcome, hence cannot see the direction of how each feature is affecting the outcome from Stabl only. To do this, you need to fit a final model with the selected features to predict the outcome (for example a Linear Regression model). You can then extract the variable's coefficients from this final model. Otherwise, you can visualize the univariate interaction between each variable with the outcome: represent each variable as a function of the outcome.

Hope it will help you, Xavier

gregbellan / Stabl

Stabl's performance on low-dimensional datasets #12