koaning / scikit-lego

Extra blocks for scikit-learn pipelines.
https://koaning.github.io/scikit-lego/
MIT License
1.28k stars 117 forks source link

[FEATURE] Preprocessing - IQR Transformer #717

Open fabioscantamburlo opened 2 days ago

fabioscantamburlo commented 2 days ago

Hello,

In my Kaggle journey I use quite often the IQR technique to fill out-of-scale values with predefined or data driven values.

I already have a scikit-compatible implementation of such a method that I use in pipelines to easy validate my models against KFold.

I think that it would be a waste of code to do not include this feature in Sklego, so I'm proposing it to the community. :people_holding_hands:

Use case scenario:

import pandas as pd
import numpy as np 

data = {
    'A': np.random.randint(10, 20, size=10),
    'B': np.random.randint(100, 200, size=10),
    'C': np.random.randint(50, 80, size=10),
    'D': np.random.randint(1, 3, size=10)
}
df = pd.DataFrame(data)
df = pd.concat([df, pd.DataFrame({
    # Adding by hand some out of scale values 
    'A': [300, -100],
    'B': [1200, -200],
    'C': [360, -10],
    'D': [30, -40]
    })], axis=0)
array([[  11,  168,   62,    1],
       [  12,  154,   64,    2],
       [  16,  156,   76,    2],
       [  10,  176,   50,    2],
       [  19,  121,   57,    2],
       [  14,  130,   73,    1],
       [  17,  107,   56,    1],
       [  12,  184,   67,    1],
       [  17,  139,   60,    1],
       [  18,  128,   54,    2],
       [ 300, 1200,  360,   30],
       [-100, -200,  -10,  -40]])

In this example I decide to fill the values with the column mean (excluding the out-of-scale values detected by IQR) After transformation:

array([[ 11. , 189. ,  77. ,   1. ],
       [ 14. , 151. ,  50. ,   1. ],
       [ 10. , 177. ,  53. ,   1. ],
       [ 19. , 197. ,  63. ,   1. ],
       [ 19. , 146. ,  65. ,   2. ],
       [ 10. , 189. ,  62. ,   2. ],
       [ 10. , 197. ,  54. ,   1. ],
       [ 19. , 146. ,  56. ,   1. ],
       [ 14. , 162. ,  69. ,   1. ],
       [ 12. , 148. ,  75. ,   2. ],
       [ 13.8, 170.2,  62.4,   1.3],
       [ 13.8, 170.2,  62.4,   1.3]])

Do you think such feature will add value to the lego toolkit?

koaning commented 1 day ago

I am not super sure what you mean when you refer to the IQR technique, could you elaborate on it and also explain why it is so beneficial in ML pipelines?

fabioscantamburlo commented 20 hours ago

Hello Vincent!

Yeah happy to do that.

IQR TRICK The idea is to use the following approach: Credits

It is a rather simple methodology but in some cases I think it is a nice starting point to get rid of some crazy values in the data without having a specific domain knowledge about the features.

In the proposed transformer, the idea is just to take the "IQR identified outliers" and replace them with some specific values.

The RobustScaler in scikit-learn is built around more or less the same idea, resulting in a scaling without considering the values flagged as outliers.