rasbt / mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.
https://rasbt.github.io/mlxtend/
Other
4.92k stars 873 forks source link

About Saving SFS Intermediate State #666

Open sskarkhanis opened 4 years ago

sskarkhanis commented 4 years ago

Hello

Question on using SFS. I have 275 features in my data and have been experimenting with using SFS to find the "best" feature set. I've tried to use the combinations of forward, backward, and the floating variants. SFS runs almost to the end e.g using backward and floating , it runs down to features 25 and then crashes. It's taken me 4-5 days to reach this point (even running on a 64 cores).

I was wondering if there is a way to extend SFS to save the intermediate state and restart from there?

rasbt commented 4 years ago

Arg, that sounds frustrating. Sorry to hear that regarding the crashing. Do you know if this is due to some multiprocessing/joblib related issue. And do you have the error message by chance?

Right now, there is no way to save an intermediate state. The only thing I can imagine doing right now is running the SFS (via backward selection) for down to say 26 features, then safe/print the feature subset and use it in another round of SFS.

    import yaml

    sfs1 = SequentialFeatureSelector(..., k_features=26)
    sfs1.fit(X, y)
    print(sfs1.subsets_)
    yaml.dump(sfs1.subsets_, 'savefile.yaml')

Then in a new session

    import yaml

    previous_subsets = yaml.load('savefile.yaml')

    sfs2 = SequentialFeatureSelector(..., k_features=1)
    sfs2.fit(X[previous_subsets[26]], y)
    print(sfs2.subsets_)

Another way would be to add an optional checkpointing parameter that saves a pickle file of the current object in each iteration.

Saniamos commented 4 years ago

I know this is an old issue, but I had a similar problem running out of memory and running into nans with the forward selection (two separate issues). One solution to at least have partial results could be to add the selected features in a print inside the library when it prints the current score. This could be problematic in backward selections though...

rasbt commented 3 years ago

Just organizing the the issues for future enhancement, and I think this may be interesting & related: #239

ecod3r commented 2 years ago

Following up on this as I think it would be very useful to have checkpointing functionality. This would not just be good for cases when computations unexpectedly fail, but also for cases where time limits or random stops apply. This is common for Colab, for computations running on pre-emptible cloud instances or some cluster environments.

It seems like saving the parameters for SFS as well as the set of selected/excluded features would be enough to be able to resume the computation from a specific round. That would be already a very useful first pass, and would be easier than resuming from the right point within the latest round. Any chance this will be worked on imminently?

For me, this makes the difference between beinig able to use the class or having to roll my own.

rasbt commented 2 years ago

Yeah, I can see how this would be very useful when running on spot instances etc. I also agree that using the parameters plus already-selected features would probably already be enough. Just thinking what what the most elegant way to achieve this would look like

a. I agree, one could just save the latest feature set (self.k_feature_idx_) and use it to subselect from the feature array, but then we would not have the dictionary for plotting, etc.

b. Maybe the most scikit-learn-like API would be to have a new method (e.g., from_subsets_(dict)) that reads in a saved self.subsets_ dictionary and updates the SFS. And then during fitting, we use warm_start=True to continue the training.

Any chance this will be worked on imminently?

Unfortunately, I don't have the capacity to work on this at the moment. But in case you find a good solution that works well, I would appreciate a PR.

deysn commented 1 year ago

Hi, I am also having the same issue while using it on a high dimensional dataset. I was wondering if any parallel/distributed implementation is already there ? Then it can be run on Spark on HPC clusters. Thanks.

rasbt commented 1 year ago

Currently, it is using joblib to handle the parallelism. Unfortunately, a parallel/distributed strategy for Spark is not implemented or planned at the moment.