scikit-learn-contrib / category_encoders

A library of sklearn compatible categorical variable encoders
http://contrib.scikit-learn.org/category_encoders/
BSD 3-Clause "New" or "Revised" License
2.41k stars 396 forks source link

Memory increase of WOEEncoder for newer category_encoders version #364

Open Piecer-plc opened 2 years ago

Piecer-plc commented 2 years ago

Memory increase of WOEEncoder for category_encoders version >=2.0.0

Hi, I noticed another memory issue with WOEEncoder. I have submitted the same bug before in #335, the difference between two bugs is the different encoder methods used and different datasets. In order to distinguish between the two encoder APIs, I resubmitted a new bug report.

Expected Behavior

Similar memory usage

Actual Behavior

According to the experiment results, when the category_encoders version is higher than 2.0.0, weight_enc.fit(train[weight_encode], train['target']) memory usage increase from 58MB to 206MB.

Memory(MB) Version
209 2.3.0
209 2.2.2
209 2.1.0
209 2.0.0
58 1.3.0

Steps to Reproduce the Problem

Step 1: Download the dataset

train.zip

Step 2: install category_encoders

pip install  category_encoders == #version#

Step 3: change category_encoders version and save the memory usage

import numpy as np 
import pandas as pd 
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
columns = [x for x in train.columns if x != 'target']
object_col_label = ['bin_0','bin_1','bin_2','bin_3','bin_4']
one_hot_encode = ['nom_0', 'nom_1', 'nom_2', 'nom_3', 'nom_4']
target_encode = ['nom_5', 'nom_6', 'nom_7', 'nom_8', 'nom_9']
weight_encode = target_encode + ['ord_4', 'ord_5' ,'ord_3'] + one_hot_encode + object_col_label
import category_encoders as ce
weight_enc = ce.woe.WOEEncoder(cols=weight_encode)
import tracemalloc
tracemalloc.start()
weight_enc.fit(train[weight_encode], train['target'])
current3, peak3 = tracemalloc.get_traced_memory()
print("Get_dummies memory usage is {",current3 /1024/1024,"}MB; Peak memory was :{",peak3 / 1024/1024,"}MB")

Specifications

Version: 2.3.0, 2.2.2, 2.1.0, 2.0.0, 1.3.0 Platform: ubuntu 16.4 OS : Ubuntu CPU : Intel(R) Core(TM) i9-9900K CPU GPU : TITAN V

glevv commented 2 years ago

Happens because WOE relies on Ordinal encoding and OE copies input data https://github.com/scikit-learn-contrib/category_encoders/blob/6a13c14919d56fed8177a173d4b3b82c5ea2fef5/category_encoders/ordinal.py#L186

bmreiniger commented 1 year ago

(When) do we actually need to copy inputs?