koaning / scikit-lego

Extra blocks for scikit-learn pipelines.
https://koaning.github.io/scikit-lego/
MIT License
1.29k stars 118 forks source link

[FEATURE] Adding the MRMR (Maximum Relevance Minimum Redundancy) feature selection #553

Closed Garve closed 8 months ago

Garve commented 1 year ago

Hi!

The only feature selections that scikit-learn offers are quite naive. MRMR seems like a bit more advanced and reasonable approach to select informative and non-redundant features as described here.

Long story short:

  1. Pick a feature that is most informative in some metric (e.g. F-statistic).
  2. Pick the next feature that is very informative, but doesn't correlate with the previous feature too much (e.g. the average absolute Pearson correlation between the current feature and the feature selected in step 2).
  3. Pick the next feature that is very informative, but doesn't correlate with the previous 2 features too much.
  4. Pick the next feature that is very informative, but doesn't correlate with the previous 3 features too much.
  5. (repeat until K features selected)

Here, K, the measure of information and correlation can be specified by the user.

koaning commented 1 year ago

Doesn't the link that you provide also mention an existing Python package that already implements this behavior? If so, does it really need to appear in two packages?

Garve commented 1 year ago

Ah the package looks more like a prototype, it's just a simple function. The main drawback is that it's not scikit-learn compatible.

I thought more of something that you find in the feature_selection subpackage of scikit-learn, such as the VarianceThresholder, SelectKBest, ...

Garve commented 1 year ago

That's the original implementation by the author of the TDS article: https://github.com/smazzanti/mrmr/blob/main/mrmr/main.py

However, I see it as more educational.

koaning commented 1 year ago

I see. In that case, I'm open certainly to the idea, but it'd help if there's a compelling example for the docs. One example that shows the merits of the approach compared to other methods.

Garve commented 1 year ago

I can ask the author if it's ok for him to use one of his. He also wrote a nice article about the MNIST dataset, and that it's possible to only use 5% of the pixels to get a 95% test accuracy.

Long story short

Here are the top pixels (features) according to some famous algorithms: image (Image by Samuele Mazzanti @smazzanti, the author)

The darker pixels are more important than the lighter ones. Here, you can see how for all methods but MRMR the most important pixels are clustered together, although maybe only a single pixel from these clusters would suffice already, as it captures enough information about this area. In the MRMR image, you can see that the first important pixels are more widespread, covering a wider area. While I think that this is nice already, the author also created a comparison in terms of accuracy.

image (Image by Samuele Mazzanti @smazzanti, the author)

koaning commented 1 year ago

If the tutorial can be passed along I think I'm open to it.

Just to check, @MBrouns any concerns?

MBrouns commented 1 year ago

No concerns here!

fabioscantamburlo commented 10 months ago

Hello! Are you still interested in having this feature? I may give it a try to implemented it.👨🏼‍💻

Garve commented 10 months ago

Heya? I definitely am 😄

FBruzzesi commented 8 months ago

Closed by #622, will be added in the next release