kaz-Anova / StackNet

StackNet is a computational, scalable and analytical Meta modelling framework
MIT License
1.32k stars 344 forks source link

Why remove outliers? #32

Closed vinkaga closed 7 years ago

vinkaga commented 7 years ago

Is there a reason for removing outliers? Could it throw away valuable information?

kaz-Anova commented 7 years ago

I agree. Where do you see this? StackNet by default does not remove outliers. The only think it does is scaling. You a responsible of controlling over-fitting through other parameters within the algorithms . For example in linear models you can use C (regularisation).

Do you mean something else?

vinkaga commented 7 years ago

I see After removing outliers message in console output of make_stacknet_data.py as follows:

+ python make_stacknet_data.py

Re-reading properties file ...
sys:1: DtypeWarning: Columns (21,22) have mixed types. Specify dtype option on import or set low_memory=False.

Processing data for XGBoost ...
Shape train: (90275, 119)
Shape test: (2985217, 119)
After removing outliers:
 shapes of dataset 2  (88528, 119) (88528,) (2985217, 119)
 printing dataset2_train.txt
 data lenth 3652911
 indices lenth 3652911
 indptr lenth 88529
...
kaz-Anova commented 7 years ago

Ok, I see. the python script was just copied from one of the public kernels that was using this. Feel free to create your own dataset without it.