Closed Garve closed 8 months ago
Doesn't the link that you provide also mention an existing Python package that already implements this behavior? If so, does it really need to appear in two packages?
Ah the package looks more like a prototype, it's just a simple function. The main drawback is that it's not scikit-learn compatible.
I thought more of something that you find in the feature_selection subpackage of scikit-learn, such as the VarianceThresholder, SelectKBest, ...
That's the original implementation by the author of the TDS article: https://github.com/smazzanti/mrmr/blob/main/mrmr/main.py
However, I see it as more educational.
I see. In that case, I'm open certainly to the idea, but it'd help if there's a compelling example for the docs. One example that shows the merits of the approach compared to other methods.
I can ask the author if it's ok for him to use one of his. He also wrote a nice article about the MNIST dataset, and that it's possible to only use 5% of the pixels to get a 95% test accuracy.
Here are the top pixels (features) according to some famous algorithms: (Image by Samuele Mazzanti @smazzanti, the author)
The darker pixels are more important than the lighter ones. Here, you can see how for all methods but MRMR the most important pixels are clustered together, although maybe only a single pixel from these clusters would suffice already, as it captures enough information about this area. In the MRMR image, you can see that the first important pixels are more widespread, covering a wider area. While I think that this is nice already, the author also created a comparison in terms of accuracy.
(Image by Samuele Mazzanti @smazzanti, the author)
If the tutorial can be passed along I think I'm open to it.
Just to check, @MBrouns any concerns?
No concerns here!
Hello! Are you still interested in having this feature? I may give it a try to implemented it.👨🏼💻
Heya? I definitely am 😄
Closed by #622, will be added in the next release
Hi!
The only feature selections that scikit-learn offers are quite naive. MRMR seems like a bit more advanced and reasonable approach to select informative and non-redundant features as described here.
Long story short:
Here, K, the measure of information and correlation can be specified by the user.