Optimise `HashingEncoder` for both large and small dataframes

bkhant1 commented 9 months ago

I used the HashingEncoder recently and found weird that any call to fit or transform, even for a dataframe with only 10s of rows and a couple of columns took at least 2s...

I also had quite a large amount of data to encode, and that took a long time.

That got me started on improving the performance of HashingEncoder, and here's the result! There are quite a few changes in there, each individual change should be in it's own commit, and here's a summary of the performance gain on my machine (macOS Monteray, i7 2.3ghz).

	Baseline	Numpy arrays instead of apply	Shared memory instead of queue	Fork instead of spawn	Faster hashlib usage
n_rows=30 n_features=3 n_components=10 n_process=4	3.55 s ± 150 ms per loop (mean ± std. dev. of ...	3.62 s ± 140 ms per loop (mean ± std. dev. of ...	2.2 s ± 41.6 ms per loop (mean ± std. dev. of ...	56.6 ms ± 2.91 ms per loop (mean ± std. dev. o...	47.3 ms ± 516 µs per loop (mean ± std. dev. of...
n_rows=30 n_features=3 n_components=10 n_process=1	1.24 s ± 52.6 ms per loop (mean ± std. dev. of...	1.42 s ± 170 ms per loop (mean ± std. dev. of ...	1.74 ms ± 32.2 µs per loop (mean ± std. dev. o...	2.08 ms ± 91.7 µs per loop (mean ± std. dev. o...	1.86 ms ± 173 µs per loop (mean ± std. dev. of...
n_rows=30 n_features=3 n_components=100 n_process=1	1.22 s ± 51.5 ms per loop (mean ± std. dev. of...	1.33 s ± 60.7 ms per loop (mean ± std. dev. of...	1.73 ms ± 29.7 µs per loop (mean ± std. dev. o...	2.01 ms ± 148 µs per loop (mean ± std. dev. of...	2.01 ms ± 225 µs per loop (mean ± std. dev. of...
n_rows=10000 n_features=10 n_components=10 n_process=4	5.45 s ± 85.8 ms per loop (mean ± std. dev. of...	5.36 s ± 57.5 ms per loop (mean ± std. dev. of...	2.23 s ± 39.6 ms per loop (mean ± std. dev. of...	120 ms ± 3.02 ms per loop (mean ± std. dev. of...	96.4 ms ± 2.33 ms per loop (mean ± std. dev. o...
n_rows=10000 n_features=10 n_components=10 n_process=1	1.61 s ± 30.1 ms per loop (mean ± std. dev. of...	1.45 s ± 27.2 ms per loop (mean ± std. dev. of...	227 ms ± 6.03 ms per loop (mean ± std. dev. of...	236 ms ± 3.06 ms per loop (mean ± std. dev. of...	170 ms ± 1.35 ms per loop (mean ± std. dev. of...
n_rows=100000 n_features=10 n_components=10 n_process=4	5.99 s ± 215 ms per loop (mean ± std. dev. of ...	5.71 s ± 148 ms per loop (mean ± std. dev. of ...	4.8 s ± 25.4 ms per loop (mean ± std. dev. of ...	836 ms ± 42.3 ms per loop (mean ± std. dev. of...	622 ms ± 33.2 ms per loop (mean ± std. dev. of...
n_rows=100000 n_features=10 n_components=10 n_process=1	5.38 s ± 53 ms per loop (mean ± std. dev. of 7...	3.73 s ± 56.5 ms per loop (mean ± std. dev. of...	2.25 s ± 57.4 ms per loop (mean ± std. dev. of...	3.76 s ± 1.61 s per loop (mean ± std. dev. of ...	1.68 s ± 19.9 ms per loop (mean ± std. dev. of...
n_rows=1000000 n_features=50 n_components=10 n_process=4	50.8 s ± 1.17 s per loop (mean ± std. dev. of ...	56.4 s ± 2.11 s per loop (mean ± std. dev. of ...	37.1 s ± 576 ms per loop (mean ± std. dev. of ...	36.9 s ± 2.19 s per loop (mean ± std. dev. of ...	26.6 s ± 1.8 s per loop (mean ± std. dev. of 7...
n_rows=1000000 n_features=50 n_components=10 n_process=1	2min 22s ± 2.05 s per loop (mean ± std. dev. o...	2min 19s ± 3.08 s per loop (mean ± std. dev. o...	1min 47s ± 1.15 s per loop (mean ± std. dev. o...	2min 10s ± 18.4 s per loop (mean ± std. dev. o...	1min 21s ± 1.67 s per loop (mean ± std. dev. o...

The notebook that produced that table can be found here

Proposed Changes

The changes are listed by commit.

Add a simple non-regression HashEncoder test

To make sure I am not breaking it.

In HashingEncoder process the df as a numpy array instead of using apply

It has no direct impact on performance, however it allows accessing the memory layout of the dataframe directly. That allows using shared memory to communicate between processes instead of a data queue, which does improve performance.

In HashEncoder use shared memory instead of queue for multiproccessing

It is faster to write directly in memory that to have to data transit through a queue.

The multiprocessing method is similar to what it was with queues: the dataframe is split into chunks, and each process applies the hashing trick to its chunk of the dataframe. Instead of writting the result to a queue, it writes it directly in a shared memory segment, that is also the underlying memory of a numpy array that is used to build the output dataframe.

Allow forking processes instead of spwaning them and make it default

This makes the HashEncoder transform method a lot faster on small datasets.

The spawn process creation method creates a new python interpreter from scratch, and re-import all required module. In a minimal case (pandas and category_encoders.hashing only are imported) this adds a ~2s overhead to any call to transform.

Fork creates a copy of the current process, and that's it. It is unsafe to use with threads, locks, file descriptors, ... but in that case the only thing the forked process will do is process some data and write it to ITS OWN segment of a shared memory. It is a lot faster as pandas doesn't have to be re-imported (around 20ms?)

It might take up more memory as more than the necessary variables (the largest one by far being the HashEncoder instance, which include the user dataframe) will be copied. Add the option to use spawn instead of fork to potentially save some memory.

Remove python 2 check code and faster use of hashlib

Python 2 is not supported on master, the check isn't useful.

Create int indexes from hashlib bytes digest instead of hex digest as it's faster.

Call the md5 hashlib constructor directly instead of new('md5'), which is also faster.

PaulWestenthanner commented 9 months ago

one more question: you've removed all the data-lock stuff for multi-processing, right. Was this not necessary?

bkhant1 commented 9 months ago

Hi @PaulWestenthanner ! Thanks for your review!

one more question: you've removed all the data-lock stuff for multi-processing, right. Was this not necessary?

It was necessary because all processes were writting to the same queue. With shared memory they write to their own section of the shared memory, so no there is no concurrency to manage.

I think I replied to all your comments/addressed them.

The one outstanding thing is that you suggested:

Wouldn't it be easier if the hash_chunk function would hash a chunk and return an array. Then it wouldn't need the shm_result and shm_offset parameters (what does shm stand for btw?). Then you'd just concatenate all the chunks in the end?

It's a little more complex than that because return doesn't really work accross processes (hence the use of queues or shared memory), but I remembered that python has a nice API, ProcessPoolExecutor, (that uses queues behind the scenes) to make it look like it's possible to return something from a process.

Here's a PR in my fork showing what I tried. It could be a little slower but the code is cleaner. I think I should add this commit to this PR but let me know what you think!

	Baseline	Before review	After review	Multiproc pool instead of SHM
n_rows=30 n_features=3 n_components=10 n_process=4	3.12 s ± 20.2 ms per loop (mean ± std. dev. of...	53.9 ms ± 443 µs per loop (mean ± std. dev. of...	55 ms ± 1.21 ms per loop (mean ± std. dev. of ...	59.8 ms ± 989 µs per loop (mean ± std. dev. of...
n_rows=30 n_features=3 n_components=10 n_process=1	1.1 s ± 17 ms per loop (mean ± std. dev. of 7 ...	1.56 ms ± 45.5 µs per loop (mean ± std. dev. o...	1.5 ms ± 44.8 µs per loop (mean ± std. dev. of...	1.52 ms ± 39.8 µs per loop (mean ± std. dev. o...
n_rows=30 n_features=3 n_components=100 n_process=1	1.1 s ± 14.5 ms per loop (mean ± std. dev. of ...	1.52 ms ± 25.2 µs per loop (mean ± std. dev. o...	1.77 ms ± 317 µs per loop (mean ± std. dev. of...	1.49 ms ± 21.7 µs per loop (mean ± std. dev. o...
n_rows=10000 n_features=10 n_components=10 n_process=4	4.75 s ± 36 ms per loop (mean ± std. dev. of 7...	99.3 ms ± 649 µs per loop (mean ± std. dev. of...	122 ms ± 13 ms per loop (mean ± std. dev. of 7...	113 ms ± 992 µs per loop (mean ± std. dev. of ...
n_rows=10000 n_features=10 n_components=10 n_process=1	1.51 s ± 13.5 ms per loop (mean ± std. dev. of...	156 ms ± 2.67 ms per loop (mean ± std. dev. of...	165 ms ± 17.6 ms per loop (mean ± std. dev. of...	146 ms ± 1.79 ms per loop (mean ± std. dev. of...
n_rows=100000 n_features=10 n_components=10 n_process=4	5.53 s ± 30.6 ms per loop (mean ± std. dev. of...	584 ms ± 6.74 ms per loop (mean ± std. dev. of...	576 ms ± 25.2 ms per loop (mean ± std. dev. of...	572 ms ± 12.4 ms per loop (mean ± std. dev. of...
n_rows=100000 n_features=10 n_components=10 n_process=1	5.22 s ± 55.3 ms per loop (mean ± std. dev. of...	1.55 s ± 11 ms per loop (mean ± std. dev. of 7...	1.48 s ± 12.6 ms per loop (mean ± std. dev. of...	1.46 s ± 32.7 ms per loop (mean ± std. dev. of...
n_rows=1000000 n_features=50 n_components=10 n_process=4	53.2 s ± 2.35 s per loop (mean ± std. dev. of ...	23.5 s ± 256 ms per loop (mean ± std. dev. of ...	22.3 s ± 200 ms per loop (mean ± std. dev. of ...	25.6 s ± 1.6 s per loop (mean ± std. dev. of 7...

PaulWestenthanner commented 9 months ago

Thanks for the changes. Looks really cool now. I'd prefer the clean ProcessPoolExecutor implementation. Then we'll be able to merge!

PaulWestenthanner commented 9 months ago

could you maybe also add a high level summary to the changelog in the unreleased section (https://github.com/scikit-learn-contrib/category_encoders/blob/master/CHANGELOG.md)?

bkhant1 commented 8 months ago

I added the ProcessPoolExecutor commit to this branch and updated the changelog!

PaulWestenthanner commented 8 months ago

Looks good! Thanks for this improvement! Makes it both more readable and faster

scikit-learn-contrib / category_encoders