Closed bkhant1 closed 8 months ago
one more question: you've removed all the data-lock stuff for multi-processing, right. Was this not necessary?
Hi @PaulWestenthanner ! Thanks for your review!
one more question: you've removed all the data-lock stuff for multi-processing, right. Was this not necessary?
It was necessary because all processes were writting to the same queue. With shared memory they write to their own section of the shared memory, so no there is no concurrency to manage.
I think I replied to all your comments/addressed them.
The one outstanding thing is that you suggested:
Wouldn't it be easier if the hash_chunk function would hash a chunk and return an array. Then it wouldn't need the shm_result and shm_offset parameters (what does shm stand for btw?). Then you'd just concatenate all the chunks in the end?
It's a little more complex than that because return
doesn't really work accross processes (hence the use of queues or shared memory), but I remembered that python has a nice API, ProcessPoolExecutor, (that uses queues behind the scenes) to make it look like it's possible to return something from a process.
Here's a PR in my fork showing what I tried. It could be a little slower but the code is cleaner. I think I should add this commit to this PR but let me know what you think!
Baseline | Before review | After review | Multiproc pool instead of SHM | |
---|---|---|---|---|
n_rows=30 n_features=3 n_components=10 n_process=4 | 3.12 s ± 20.2 ms per loop (mean ± std. dev. of... | 53.9 ms ± 443 µs per loop (mean ± std. dev. of... | 55 ms ± 1.21 ms per loop (mean ± std. dev. of ... | 59.8 ms ± 989 µs per loop (mean ± std. dev. of... |
n_rows=30 n_features=3 n_components=10 n_process=1 | 1.1 s ± 17 ms per loop (mean ± std. dev. of 7 ... | 1.56 ms ± 45.5 µs per loop (mean ± std. dev. o... | 1.5 ms ± 44.8 µs per loop (mean ± std. dev. of... | 1.52 ms ± 39.8 µs per loop (mean ± std. dev. o... |
n_rows=30 n_features=3 n_components=100 n_process=1 | 1.1 s ± 14.5 ms per loop (mean ± std. dev. of ... | 1.52 ms ± 25.2 µs per loop (mean ± std. dev. o... | 1.77 ms ± 317 µs per loop (mean ± std. dev. of... | 1.49 ms ± 21.7 µs per loop (mean ± std. dev. o... |
n_rows=10000 n_features=10 n_components=10 n_process=4 | 4.75 s ± 36 ms per loop (mean ± std. dev. of 7... | 99.3 ms ± 649 µs per loop (mean ± std. dev. of... | 122 ms ± 13 ms per loop (mean ± std. dev. of 7... | 113 ms ± 992 µs per loop (mean ± std. dev. of ... |
n_rows=10000 n_features=10 n_components=10 n_process=1 | 1.51 s ± 13.5 ms per loop (mean ± std. dev. of... | 156 ms ± 2.67 ms per loop (mean ± std. dev. of... | 165 ms ± 17.6 ms per loop (mean ± std. dev. of... | 146 ms ± 1.79 ms per loop (mean ± std. dev. of... |
n_rows=100000 n_features=10 n_components=10 n_process=4 | 5.53 s ± 30.6 ms per loop (mean ± std. dev. of... | 584 ms ± 6.74 ms per loop (mean ± std. dev. of... | 576 ms ± 25.2 ms per loop (mean ± std. dev. of... | 572 ms ± 12.4 ms per loop (mean ± std. dev. of... |
n_rows=100000 n_features=10 n_components=10 n_process=1 | 5.22 s ± 55.3 ms per loop (mean ± std. dev. of... | 1.55 s ± 11 ms per loop (mean ± std. dev. of 7... | 1.48 s ± 12.6 ms per loop (mean ± std. dev. of... | 1.46 s ± 32.7 ms per loop (mean ± std. dev. of... |
n_rows=1000000 n_features=50 n_components=10 n_process=4 | 53.2 s ± 2.35 s per loop (mean ± std. dev. of ... | 23.5 s ± 256 ms per loop (mean ± std. dev. of ... | 22.3 s ± 200 ms per loop (mean ± std. dev. of ... | 25.6 s ± 1.6 s per loop (mean ± std. dev. of 7... |
Thanks for the changes. Looks really cool now.
I'd prefer the clean ProcessPoolExecutor
implementation. Then we'll be able to merge!
could you maybe also add a high level summary to the changelog in the unreleased section (https://github.com/scikit-learn-contrib/category_encoders/blob/master/CHANGELOG.md)?
I added the ProcessPoolExecutor
commit to this branch and updated the changelog!
Looks good! Thanks for this improvement! Makes it both more readable and faster
I used the HashingEncoder recently and found weird that any call to
fit
ortransform
, even for a dataframe with only 10s of rows and a couple of columns took at least 2s...I also had quite a large amount of data to encode, and that took a long time.
That got me started on improving the performance of HashingEncoder, and here's the result! There are quite a few changes in there, each individual change should be in it's own commit, and here's a summary of the performance gain on my machine (macOS Monteray, i7 2.3ghz).
The notebook that produced that table can be found here
Proposed Changes
The changes are listed by commit.
Add a simple non-regression HashEncoder test
To make sure I am not breaking it.
In HashingEncoder process the df as a numpy array instead of using apply
It has no direct impact on performance, however it allows accessing the memory layout of the dataframe directly. That allows using shared memory to communicate between processes instead of a data queue, which does improve performance.
In HashEncoder use shared memory instead of queue for multiproccessing
It is faster to write directly in memory that to have to data transit through a queue.
The multiprocessing method is similar to what it was with queues: the dataframe is split into chunks, and each process applies the hashing trick to its chunk of the dataframe. Instead of writting the result to a queue, it writes it directly in a shared memory segment, that is also the underlying memory of a numpy array that is used to build the output dataframe.
Allow forking processes instead of spwaning them and make it default
This makes the HashEncoder transform method a lot faster on small datasets.
The spawn process creation method creates a new python interpreter from scratch, and re-import all required module. In a minimal case (pandas and category_encoders.hashing only are imported) this adds a ~2s overhead to any call to transform.
Fork creates a copy of the current process, and that's it. It is unsafe to use with threads, locks, file descriptors, ... but in that case the only thing the forked process will do is process some data and write it to ITS OWN segment of a shared memory. It is a lot faster as pandas doesn't have to be re-imported (around 20ms?)
It might take up more memory as more than the necessary variables (the largest one by far being the HashEncoder instance, which include the user dataframe) will be copied. Add the option to use spawn instead of fork to potentially save some memory.
Remove python 2 check code and faster use of hashlib
Python 2 is not supported on master, the check isn't useful.
Create int indexes from hashlib bytes digest instead of hex digest as it's faster.
Call the md5 hashlib constructor directly instead of new('md5'), which is also faster.