scikit-learn-contrib / category_encoders

A library of sklearn compatible categorical variable encoders
http://contrib.scikit-learn.org/category_encoders/
BSD 3-Clause "New" or "Revised" License
2.4k stars 393 forks source link

Poor performance of OneHotEncoder for category_encoders version >=2.0.0 #362

Open DSOTM-pf opened 2 years ago

DSOTM-pf commented 2 years ago

Expected Behavior

Similar memory usage for the different category_encoders versions or better performance for higher category_encoders versions.

Actual Behavior

According to the experiment results, when the category_encoders version is higher than 2.0.0, the performance of the model is worse. Memory(MB) Version
896 2.3.0
896 2.2.2
896 2.1.0
896 2.0.0
288 1.3.0

Steps to Reproduce the Problem

Step 1: download above dataset train & test (63MB) Step 2: install category_encoders

pip install  category_encoders == #version#

Step 3: change category_encoders version and save the memory usage

import numpy as np 
import pandas as pd
import category_encoders as ce
import tracemalloc
df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")
df_train.drop("id", axis=1, inplace=True) 
df_test.drop("id", axis=1, inplace=True) 
cat_labels = [f"cat{i}" for i in range(10)]

tracemalloc.start()
onehot_encoder = ce.one_hot.OneHotEncoder() 
onehot_encoder.fit(pd.concat([df_train[cat_labels], df_test[cat_labels]], axis=0))
train_ohe = onehot_encoder.transform(df_train[cat_labels]) 
test_ohe = onehot_encoder.transform(df_test[cat_labels]) 

current3, peak3 = tracemalloc.get_traced_memory()
print("Get_dummies memory usage is {",current3 /1024/1024,"}MB; Peak memory was :{",peak3 / 1024/1024,"}MB")

Specifications

PaulWestenthanner commented 2 years ago

Thanks for that issue report. From the top of my head I'd guess this is because most encoders convert the input to dataframes and create a deep copy of it. Maybe this wasn't the case yet in old versions. I'd need some time to check if this is really the reason. I'm also not sure if the deep copies can safely be removed, probably there was a reason the add them in the first place.
If you want to investigate it feel free, otherwise I'll have a look and keep you posted

DSOTM-pf commented 2 years ago

Hi, thanks for your quick reply! I have observed the same memory usage issue with WOEEncoder (https://github.com/scikit-learn-contrib/category_encoders/issues/364 ). About the root cause of the memory increase in these two APIs, I tried to look for it in the code changes in version 2.0.0, but it was not a good choice due to the number of code changes. I will take your suggestion and see if it is due to "deep copy".

PaulWestenthanner commented 2 years ago

https://github.com/scikit-learn-contrib/category_encoders/blob/5e9e803c9131b377af305d5302723ba2415001da/category_encoders/one_hot.py#L340

that should be the relevant line