mmahmoudian / sivs

An iterative feature selection method that internally utilizes varius Machine Learning methods that have embeded feature reduction in order to shrink down the feature space into a small and yet robust set.
GNU General Public License v3.0
4 stars 1 forks source link

If there is only one converged run, the `sivs` will break #2

Open mmahmoudian opened 1 year ago

mmahmoudian commented 1 year ago

There is one very rare condition that sivs() does not catch and it will break without a proper error. In this post/issue I'm going to first dissect the way this issue can happen, and then I will list how this can be addressed, and ultimately I elaborate on my preference on how this should be addressed.

The reasons I'm writing this are:

Dissection of the problem

In the following line we are removing the runs that have failed to converge (due to whatever reason that is relevant to the chosen method)

https://github.com/mmahmoudian/sivs/blob/4a57625f2e430e7cb785dec5b9f9bd8b3a75c41d/R/sivs.R#L1010-L1011

and in the following lines we are making sure that we have at least something to proceed with:

https://github.com/mmahmoudian/sivs/blob/4a57625f2e430e7cb785dec5b9f9bd8b3a75c41d/R/sivs.R#L1013-L1019

and if we have at least one good run, we proceed with this beautiful and compact part to extract the coefficients in a data.frame format:

https://github.com/mmahmoudian/sivs/blob/4a57625f2e430e7cb785dec5b9f9bd8b3a75c41d/R/sivs.R#L1021-L1030

The issue raises when clean.iterative.res has length of 1, which subsequently would cause the coef.df to be a data.frame with only two columns. When we move the row names to row.names, the data.frame would be left with only one column which consequently turn into a vector:

https://github.com/mmahmoudian/sivs/blob/4a57625f2e430e7cb785dec5b9f9bd8b3a75c41d/R/sivs.R#L1032-L1034

This can be reproduced with this small demo:

# only keeping two columns
a <- iris[, 1:2]
# convert the first column to some unique character values
a[, 1] <- paste0("a", 1:nrow(a))
# the following two lines is exactly what we do in the code chuck above (sivs.R#L1032-L1034)
row.names(a) <- a[, 1]
a <- a[, -1]
> a
  [1] 3.5 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 3.7 3.4 3.0 3.0 4.0 4.4 3.9 3.5 3.8 3.8 3.4 3.7 3.6 3.3 3.4 3.0 3.4 3.5 3.4 3.2 3.1 3.4 4.1 4.2 3.1
 [36] 3.2 3.5 3.6 3.0 3.4 3.5 2.3 3.2 3.5 3.8 3.0 3.8 3.2 3.7 3.3 3.2 3.2 3.1 2.3 2.8 2.8 3.3 2.4 2.9 2.7 2.0 3.0 2.2 2.9 2.9 3.1 3.0 2.7 2.2 2.5
 [71] 3.2 2.8 2.5 2.8 2.9 3.0 2.8 3.0 2.9 2.6 2.4 2.4 2.7 2.7 3.0 3.4 3.1 2.3 3.0 2.5 2.6 3.0 2.6 2.3 2.7 3.0 2.9 2.9 2.5 2.8 3.3 2.7 3.0 2.9 3.0
[106] 3.0 2.5 2.9 2.5 3.6 3.2 2.7 3.0 2.5 2.8 3.2 3.0 3.8 2.6 2.2 3.2 2.8 2.8 2.7 3.3 3.2 2.8 3.0 2.8 3.0 2.8 3.8 2.8 2.8 2.6 3.0 3.4 3.1 3.0 3.1
[141] 3.1 3.1 2.7 3.2 3.3 3.0 2.5 3.0 3.4 3.0

And this is the problem, because the following apply() expects a tabular object but what it will receive is just a vector:

https://github.com/mmahmoudian/sivs/blob/4a57625f2e430e7cb785dec5b9f9bd8b3a75c41d/R/sivs.R#L1036-L1039


Possible solutions and my thoughts

There are many ways to solve this programmatically, like converting the coef.df to a matrix or even using the tibble package for all tabular data we have in the SIVS package; but the question is more fundamental imho (when nfolds > 1) :

I would argue that the RFE should not be ran, because having one single lucky run is well against having a reproducible research and what "Stable Iterative Variable Selection" is all about.

If we decide not to move to the next step in SIVS (i.e recursive feature elimination), then the question would be on how this should be reported to user and what should we return? Possible options are

  1. Throw an error and not returning anything to user (even that one lucky run)
  2. Throw a warning and return whatever we have so far

I would argue that we should throw an error and do not return anything to user. This is because, in spite of writing it in the article that SIVS should not be treated as a black-box method, many naively would still run the code and expect magic, and in the off chance of hitting this "lucky" run, they will just go with what SIVS is providing and they will move forward without thinking twice about what has happened and why. Perhaps when #1 is implemented, the user can have access to the list of coefficients and can do some sort of RFE themselves at their own risk if they want to.

Reports

This issue has been reported on Stackoverflow as well:

Acknowledgment

I would like to thank Alexander Biehl who reported this issue. I will edit this post when/if I get their consent to put their real name or GithubID here. As I mentioned, this is a very rare case and Alex responsibly and kindly reported the issue to me via email. The time he invested in writing that email should also be factored in for resolution of this issue. Also Alex pointed me to the Stackoverflow post which I mentioned above.

mmahmoudian commented 1 year ago

Alex and I exchanged few emails, but I thought this part of his last email should be posted here as this is directly contributing towards addressing this issue:

I think I agree towards your conclusion that it should probably not proceed towards rfe, as there was only one successful run. However on the other hand that also raises the question, on how many runs should be successful before moving on to the rfe step, as even with only 2 successful runs, it does also fall under "against having a reproducible research and what "Stable Iterative Variable Selection" is all about", as you wrote. But that would mean deciding on either a hard number or a percentage of overall runs, that should always complete, before moving on to rfe. This could also be an argument option (with a default of e.g. 50% or similar), then the users can still change it (and hopefully are aware of the meaning). As my suggestion below, this is however the most work-intensive option for you regarding implementation.

Either way, on how you report this. Arguments can be made for both sides. Assuming a "naive user", an error might lead them to assume SIVS does not work and move to a different feature elimination method, not solving their problem. While a warning might just be ignored (or even missed, depending on the implementation - I for one used SIVS in leave-pair-out cross-validation with many thousand runs of sivs, I would not even know if I would see the warnings then).

One approach that is maybe a good middle-ground, but also possibly the most work for you to implement, would be to throw an custom error explaining the problem, however also to give an argument option to instead just throw a warning? The error could reference that argument option, so people could still run SIVS if they insist on it - and you at least did every possible thing you could regarding incentivicing them to acknowledge the problem, but still don't necessarily drive them to other feature selection methods that generally are not as thorough as SIVS in their design towards reproducible research.

Both these suggestions are of course questionable regarding implementation_work-usefulness balance as this could very well be a rare edge-case as you mentioned, plus at least the first suggestion might change previous SIVS results (where unknowingly some results might have had only 2 successful runs, although I would expect this also does happen quite rarely - I ofc don't have data to back that up though).

Example pseudo-code would be (including both suggestions):

sivs::sivs(x = data, [...], converged_run_perc = 0.5, error_when_too_few_runs = TRUE)

[...]
if ((attempted_runs / converged_runs) < converged_run_perc):
    if (error_when_too_few_runs):
        raise_error() # explains source and the implications, but mentions error_when_too_few_runs option still
    else:
        raise_warning() # explains source and the implications
else:
    continue with RFE step