PR for Parallelization and Reduce Memory

pford221 commented 5 years ago

Hello,

For high dimensional datasets, I'm finding multi-processing parallelization can speed things up a bit. I also, find that storing the original data in each Node and each iTree consumes a lot of needless memory. Would you be open to reviewing a Pull Request(s) that addressed both of these items? If so, would you accept them bundled together as one PR or would you like them separated?

Thanks

mgckind commented 5 years ago

yes, that'd be great!, separate PR might be better though, really appreciate the extra help

pford221 commented 5 years ago

@mgckind

I'm a bit at a loss. When I looked at the source code, I saw for the for loop in the __init__ of iForest that makes the trees and I thought it looked ripe for parallelization over multiple processes. However, as I've profiled the cpu utilization on both a Windows and Ubuntu machine, I've noticed that a single python process is able to utilize all the cpu available across all the cores on the machine without any explicit parallelization.

What this means is that explicit parallelization is not building the forest any faster (in fact, it's a bit slower because of the overhead of the processes). This is great, but it feels like a learning moment for me. Do you know why/how this is happening? Is it a property of the recursive nature of this loop (repeatedly calling make_tree)? I've been googling like crazy and haven't found any hints that recursion would lead to this result, but it's the only thing I can come up with.

yorickvanzweeden commented 5 years ago

@pford221 Using a simple system monitor on Ubuntu 18.04, I have noticed that only 1 core seems to be used when using this version of eif.

I then looked at the forks and found isezen's version

I could execute forest.compute_paths(X_in=data, n_jobs=8) and achieve full CPU usage.

sahandha / eif

PR for Parallelization and Reduce Memory #9