sktime / sktime

A unified framework for machine learning with time series
https://www.sktime.net
BSD 3-Clause "New" or "Revised" License
7.74k stars 1.32k forks source link

[ENH] mRMR, ReliefF and RReliefF feature selector #6972

Open ncooder opened 3 weeks ago

ncooder commented 3 weeks ago

Add the mRMR, ReliefF and RReliefF feature selectors like in Matlab: mRMR and ReliefF and RReliefF

fkiraly commented 3 weeks ago

Quick question, what is the type of the resulting algorithm?

First I thought this is a series-to-series transformer, but it looks like two tabular algorithms. They also return weights and indices, so I'm not sure about the correct categorization.

Have you thought about the type of algorithm? Is it operating on time index, or is it operating on collections of time series?

ncooder commented 3 weeks ago

@fkiraly They are tabular feature selection algorithms. In principle those algorithms are not series-to-series transformers and do not handle sequential time series data directly. Actually I found a nice implementation in Python here. Maybe we can add them to the Time series clustering, since they do something similar.

fkiraly commented 3 weeks ago

Hm, yes, I would also agree they look like tabular algorithms, specifically transformations in the sklearn API with required y.

Te package you linked looks like what you'd need, an sklearn interface!

However, upon cursory review, there are a few problems:

A fully conformant sklearn estimator can be used in sktime pipelines, via TabularToSeriesAdaptor.

If you would like, I can advise you on how to return sklearn-relief to maintained state - this would need transferring the code base, adding sklearn conformance checks plus continuous integration, testing and updating compatibility with newer python versions, and newer package versions such as numpy 2, and releasing new versions on pypi. (we can do this as part of the sktime umbrella if you - or someone else - would like to invest some time into it. It would be a mini-project, I estimate a few days of work but not more, and then continuous maintenance mode efforts, perhaps a few hours per month)

fkiraly commented 3 weeks ago

Adding the "good first issue" tag since it's now clear what we need to do, see above.

ncooder commented 3 weeks ago

@fkiraly Thank you for your response. I can contribute to sktime and add those algorithms, however, I would need some guidance. I have used this sklearn-relief package before, but I think these algorithms should be available in more well-established libraries like sktime.

fkiraly commented 3 weeks ago

Ok, great! If you are willing to invest a few days, happy to guide.

Some additional research has uncovered these: https://github.com/EpistasisLab/scikit-rebate https://github.com/EpistasisLab/ReBATE

these also look abandoned since 4 years (no maintenance updates)

This is, interestingly, from the same lab that created tpot. FYI @simonblanke due to that connection.

Another question would be if anyone from the above packages would be interested in contributing to a reactivation/revival. I would consider this question separately from the current project, as no answer may be forthcoming, I am just going to ping everyone who seems relevant.

for epistasislab: @rhiever, @ryanurbs, @alexmxu, @sauravbose, @robertfrankzhang, @pschmitt52 Alfredo Mungo is now at Amazon: @moongoal - he seems to have disappeared from open source in April 2024.

fkiraly commented 3 weeks ago

Here's my suggestion on how to start with the project:

  1. we create a separate repo in the sktime organization - currently, there is no dedicated place for sklearn tabular estimators, we have them in various places. In this case, there are multiple existing repos that one can "revive", which would make it easiest.
  2. I would base this on sklearn-relief, because that has a more robust test setup. I would give you write access, so you can work on it, @ncooder - and anyone else who would like to, so you can work on it easily (I would ask you to show a legal name and org email on your account though - or, you can anonymously make PR without write access).
  3. we update the dependency file to a modern pyproject.toml based requirement spec.
  4. on this clone, we add sklearn API conformance tests along the pattern described here: https://scikit-learn.org/stable/developers/develop.html - this is a one-liner test. But, it will probably lead to a number of failures, which we will need to debug and fix.
  5. at this point, we can add sktime integration tests, i.e., a test-import of sktime, checking that the intended pipeline usage case for time series runs.
  6. once everything runs, we can push a release. If @moongoal is so kind to hand over the pypi project, this can be put under sktime governance, otherwise we take another release name.
  7. after this, we can integrate estimators in scikit-rebate. Because the testing framework will already be established, we just plug the new estimators in and repeat the test/bugfix cycle per estimator.

How does this sound? I estimate perhaps a few days of work for someone with good python experience; a newcomer might need a few weeks since it would also mean getting to grips with package development, testing, etc.

fkiraly commented 3 weeks ago

(also happy to personally help with setting up stuff like testing, if there's someone else also committing to contribute)

ryanurbs commented 3 weeks ago

Hello, I led the creation of the above mentioned REBATE and scikit-rebate packages while in the lab of Dr. Moore (Github EpistasisLab). I've since started my own lab and some new project and our github page is: https://github.com/UrbsLab.

We haven't worked on rebate in a while, but actually just laid out plans to update and expand the scikit-rebate package further over the upcoming months. We were planning to release these updates through my own lab's github page at the cloned repository https://github.com/EpistasisLab/scikit-rebate, and sync these with the version on EpistasisLab (who we still collaborate with). Presently we don't have any specific plans to adapt the Relief-based algorithms included in REBATE to time series data (which seems to be the focus of your package), as these Relief algorithms are designed to be feature importance estimation algorithms and/or feature selection algorithms for structured/tabular data.

If you have questions about these algorithms or want to follow up on some type of collaboration feel free to message me or email me at ryan.urbanowicz@cshs.org.

fkiraly commented 3 weeks ago

That's good to hear!

From a software engineering perspective though, I wonder whether it is the best idea to have one repository per algorithm. That must be a maintenance nightmare, and in general it feels impossible to keep all the repos up to date, as packages like sklearn and numpy are getting updated - this is probably also the reason why it feels some of these packages (not scikit-rebate but the others cited) were getting abandoned or unmaintained.

May I hence suggest a single index and test framework for feature importance algorithms, similar to what sktime has? See here for the index: https://www.sktime.net/en/latest/estimator_overview.html

Behind the scenes, there is a single test framework that always highlights when one of the individual algorithms or repositories drops out of the compatibility horizon, and the algorithms can be owned by research groups, companies, and don't even need to be in the main repo. (note the "authors" column in the index, these are mostly third parties)

See also this issue: https://github.com/sktime/sktime/issues/6639

Would you - and other people in this issue - be interested in creating this for the variety of feature selection algorithms mentioned here? They would be useful for time series, and are directly applicable through tabularization.

The next steps would still be the same, although we would also add an indexing and testing framework to the new repo and maybe call it slightly more generally to indicate that it is a collaborative.

fkiraly commented 3 weeks ago

PS @ryanurbs, I gather from your profile that you might be interested in skpro, a package for tabular probabilistic regression and individual survival predictions? https://github.com/sktime/skpro

Same principle, algorithms can be first party or third party.

ryanurbs commented 3 weeks ago

My personal specialty is ML and AI algorithm design and development (for biomedicine), not so much software development so i'll point your suggestion/comments to my team as we consider how to best update scikit-rebate and our planned algorithmic expansions. We are also interested in, and actively developing methods for survival predictions in right-censored survival data so I'll also check out your skpro package. Thanks!

fkiraly commented 3 weeks ago

So, interested in collaborating on the software infrastructure? Sounds like a great synergy!

Re survival prediction, as mentioned, the packages in the sktime ecosystem (such as skpro) follow the principle that algorithms can be added to the index, where lots of people can find them when they search for them, while the authors retain ownership, possibly even in separate packages.

ryanurbs commented 3 weeks ago

Potentially yeah, i guess it depends more specifically on what you have in mind. We are mainly focused on further developing and improving the algorithms in the Rebate framework, and plan to keep working on them within the current respositories on EpistasisLab and Urbs-lab for the time being, since that's where we cite the code in our related publications.

fkiraly commented 3 weeks ago

The same concept as with skpro and other repos: one "index" repo with systematic testing, and the original algorithms remain in their "home" repositories where they can be cited, versioned, via papers or zenodo. This makes it easier for users such as @ncooder to find an algorithm based on type, tags, properties, date of release, etc

So, the algorithm repositories would be separate if they are maintained, and the algorithms from repositories that are no longer maintained can be moved, or the repositories "revived" along the above lines. The upstream repositories would also get notifications, from continuous integration, if their algorithm fails with the newest versions of numpy, python or the like.

Interested in contributing? I can make a start with this more general concept in mind.

ncooder commented 3 weeks ago

@fkiraly I may not fully understand the process we want to follow. In principle, my idea was to implement an improved version of those feature selectors that is independent of external packages. I see there is a way to include them as third-party software. So if you could please create a place in sktime where I can commit these selectors, that would be great. Thanks.

fkiraly commented 3 weeks ago

yes, good point. Let's start with a simplified plan.

How about you start adding the transformers in a new folder, sktime.transformations.tabular? Just put them in one or multiple files in there, e.g., relief.py, for now.

I can see how these would be useful in a time series setting. And I am happy to add tests for sklearn transformers there, since it's the first such instance.

What do you think? Here's the sklearn extension template: https://github.com/scikit-learn-contrib/project-template/blob/main/skltemplate/_template.py

ncooder commented 3 weeks ago

@fkiraly Many thanks. I will submit a PR with the code, likely this week. I really like your approach, but I am still quite new to GitHub and not able to do anything more sophisticated at this moment. I would really appreciate it if you could message me using the email I provided if you want to pursue something more advanced. Danke!

fkiraly commented 3 weeks ago

Sure! If you're new to open source, I'd say, let's just get started with this small project and take if from there! Looking forward to your PR.