scikit-learn-contrib / category_encoders

A library of sklearn compatible categorical variable encoders
http://contrib.scikit-learn.org/category_encoders/
BSD 3-Clause "New" or "Revised" License
2.4k stars 393 forks source link

Improve hash-encoder performance - avoid unecessary Series construction #388

Closed bkhant1 closed 1 year ago

bkhant1 commented 1 year ago

I used the hash-encoder the other day and was surprised by the time it takes for a theorically quite lightweight algorithm.

I looked into it in a bit more detail, generating a random dataframe with 100,000 rows to measure performance. It turns out that in that case >50% of the time is spent in the pandas.Series constructor. Here's my notebook if you are curious!

Proposed Changes

Result

Here's the performance table comparision

Dataframe #columns, #rows / #nb_components, #nb_process pandas.Series Raw list
3, 30 / 10, 4 5.05s 5.23s
3, 30 / 10, 1 2.1s 2.08s
3, 30 / 100, 1 2.09s 2.21s
10, 10k / 10, 4 7.94s 7.59s
10, 10k / 10, 1 3.5s 2.31s
50, 1000k / 10, 4 1min30s 51.4s
50, 1000k / 10, 1 4min8s 2min

For larger dataframes, it slashes the time it takes to transform in two. Here's the notebook those results are from.

Follow-up

I thought I would start with this change as it's a 2-liner!

But there are still some things I would want to look into:

PaulWestenthanner commented 1 year ago

Hi, thanks for your effort and this very detailed pull request. The errors in the pipeline are my fault (connected to #384). I'll try to sort it out this week and have your PR merged as well. Looks like a great improvement already

bkhant1 commented 1 year ago

Sounds good, thanks for taking the time to review!

PaulWestenthanner commented 1 year ago

@bkhant1 I've just merged the fixes in master. Can you please pull master and see if the pipeline still fails?

bkhant1 commented 1 year ago

Just done it - the workflow needs approval to run.

PaulWestenthanner commented 1 year ago

LGTM! I'll merge it. Sorry again for the broken pipeline in the first place. Looking forward to see more contributions