rasbt / mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.
https://rasbt.github.io/mlxtend/
Other
4.82k stars 853 forks source link

Exhaustive Feature Selector Performance is poor #971

Closed rasbt closed 1 year ago

rasbt commented 1 year ago

I recently noticed that running the EFS on larger datasets now has a very poor performance due to blocking the CPU. I think this has something to do with the joblib rewrite and sending out the individual tasks.

On one hand, it's to be expected that large feature sets can't be exhaustively samples. On the other hand, EFS now supports KeyboardInterrupt, so it should be in theory possible to just evaluate a unreasonably large number of combinations and then interrupt the run.

A potential way to make this work would be to have an iterative algorithm for feature subset sampling when n_jobs=0. Right now, this setting is no supported anyway, so we can take advantage of that.

NimaSarajpoor commented 1 year ago

I am planning to work on this as I see it as a good opportunity to learn something new. Now, I need to handle an issue that is different than the ones I resolved before.

I might be slow on this one as I probably need to study a few things first.

rasbt commented 1 year ago

Thanks for offering your help once more @NimaSarajpoor! And please don't worry about the timeline!

Here is some more context: I noticed that the performance becomes poor / gets stuck for even moderately sized problems.

E.g., for 10 features like

X = np.random.rand(10_000, 10)

examining all features subsets would only equal 1023 candidates, which should make the algo stuck if you use something like logistic regression.

Here are some thoughts for how to approach this

My bet is that it has something todo with Joblib, and that the feature subset generator is evaluated so that it blows up.

I was just looking into the code again (which has become quite complicated over the years sry!) ... and I currently can't see another culprit. E.g., I just double-checked in line 406 that it's an iterable / generator, not a list, so that should not be the culprit"

candidates = chain.from_iterable(combinations(X.shape[1], r=i) for i in range(X.shape[1]))
for candidate in candidates:
    pass

It's basically just a compact way of writing

for length in range(X.shape[1]):
    for candidate in combinations(range(X.shape[1]), length):
        pass

I think the easiest way to address this issue is to introduce an if-else clause in line 441

if not n_jobs: # I.e. if n_jobs is set to 0
    for candidate in candidates:
        # do feature selection procedure
else:
    current behavior

And then if the simple version of that works (I wouldn't worry about progress bars etc), I would compare the runtime for different small datasets between n_jobs=0 and n_jobs=1

NimaSarajpoor commented 1 year ago

@rasbt Thanks for providing more info. That should help me a lot for sure. I will work on this and if I get into a trouble or if I have a question, I will let you know :)

NimaSarajpoor commented 1 year ago

@rasbt Hi Sebastian... I just wanted to let you know that My mind is occupied with something and I cannot focus well. I will start working on this in about two or three weeks from now.

rasbt commented 1 year ago

Hey @NimaSarajpoor no worries at all and thanks for the note! 😊

NimaSarajpoor commented 1 year ago

@rasbt I tried to test the existing code regarding this issue. I have a question for you:

I think the easiest way to address this issue is to introduce an if-else clause in line 441

if not n_jobs: # I.e. if n_jobs is set to 0
    for candidate in candidates:
        # do feature selection procedure
else:
    current behavior

And then if the simple version of that works (I wouldn't worry about progress bars etc), I would compare the runtime for different small datasets between n_jobs=0 and n_jobs=1

According to the documentation of joblib, the param n_jobs does not accept value 0. In fact, I tried and got an error regarding that.

So, I was wondering what you actually meant was n_jobs=1 (1 CPU) and n_jobs=-1 (all CPU)? If that is what you meant, then I can start comparing the computing time of existing code with the new code (see below) for n_jobs==1

# according to what you suggested 

if n_jobs == 1: 
    for candidate in candidates: # avoiding joblib
        # do feature selection procedure
else:
    current behavior
rasbt commented 1 year ago

Hi @NimaSarajpoor, sorry about the confusion. That's actually intentional (but maybe confusing if people are too familiar with joblib, haha).

So, I was thinking since joblib doesn't use n_jobs=0/False anyways, we can use it to run it without joblib. I think joblib can be a problem here because it spawns too many jobs. So instead of using joblib when n_jobs=0, we just have a plain old Python for-loop. I think that's how I originally started this implementation many years ago.

Or do you think that's too confusing and we should not do this? I was thinking this is maybe a good way to avoid introducing an additional parameter ...

NimaSarajpoor commented 1 year ago

Sorry for the questions...I just want to make sure I am not missing anything here.

If I understand correctly, you want to use the case n_jobs=0 to turn off joblib, which means the program will be run on 1 CPU, which is is similar to n_jobs=1. However, based on what you explained, we prefer to avoid using joblib in such case. I assume you do not care about size of the input, and you want to turn off joblib whenever we have 1 CPU.

Is that correct? If yes, then I think:

if n_jobs == 1: # turning off joblib
    for candidate in candidates:
        # do feature selection procedure
else:
    current behavior

is good.

And, we avoid letting param n_jobs be 0. I mean...if I set n_jobs to 0, I prefer to get error. And, when I provide 1, I can just turn off joblib.


Do you think joblib with n_jobs=1 perform faster than "no joblib" for some inputs? (I think they each runs with 1 CPU.) If yes, then what I proposed above should suffice. Right? (We always avoid joblib when we have n_jobs=1.)

Please feel free and let me know if I am missing anything here.

rasbt commented 1 year ago

If I understand correctly, you want to use the case n_jobs=0 to turn off joblib, which means the program will be run on 1 CPU, which is is similar to n_jobs=1

Yeah, that's what I had in mind. I do think that n_jobs=1 does not turn off joblib though but will run joblib in 1 main process. I think there's something about joblib that doesn't work well with large sets (it may be hogging too much memory in this particular case due to how it's implemented and then it gets stuck)

But yeah, instead of n_jobs=0, your suggested version is maybe more clear:

if n_jobs == 1: # turning off joblib
    for candidate in candidates:
        # do feature selection procedure
else:
    current behavior