Closed vinkaga closed 7 years ago
I agree. Where do you see this? StackNet by default does not remove outliers. The only think it does is scaling. You a responsible of controlling over-fitting through other parameters within the algorithms . For example in linear models you can use C (regularisation).
Do you mean something else?
I see After removing outliers
message in console output of make_stacknet_data.py
as follows:
+ python make_stacknet_data.py
Re-reading properties file ...
sys:1: DtypeWarning: Columns (21,22) have mixed types. Specify dtype option on import or set low_memory=False.
Processing data for XGBoost ...
Shape train: (90275, 119)
Shape test: (2985217, 119)
After removing outliers:
shapes of dataset 2 (88528, 119) (88528,) (2985217, 119)
printing dataset2_train.txt
data lenth 3652911
indices lenth 3652911
indptr lenth 88529
...
Ok, I see. the python script was just copied from one of the public kernels that was using this. Feel free to create your own dataset without it.
Is there a reason for removing outliers? Could it throw away valuable information?