Closed pford221 closed 5 years ago
yes, that'd be great!, separate PR might be better though, really appreciate the extra help
@mgckind
I'm a bit at a loss. When I looked at the source code, I saw for the for
loop in the __init__
of iForest
that makes the trees and I thought it looked ripe for parallelization over multiple processes. However, as I've profiled the cpu utilization on both a Windows and Ubuntu machine, I've noticed that a single python process is able to utilize all the cpu available across all the cores on the machine without any explicit parallelization.
What this means is that explicit parallelization is not building the forest any faster (in fact, it's a bit slower because of the overhead of the processes). This is great, but it feels like a learning moment for me. Do you know why/how this is happening? Is it a property of the recursive nature of this loop (repeatedly calling make_tree
)? I've been googling like crazy and haven't found any hints that recursion would lead to this result, but it's the only thing I can come up with.
@pford221 Using a simple system monitor on Ubuntu 18.04, I have noticed that only 1 core seems to be used when using this version of eif.
I then looked at the forks and found isezen's version
I could execute forest.compute_paths(X_in=data, n_jobs=8)
and achieve full CPU usage.
Hello,
For high dimensional datasets, I'm finding multi-processing parallelization can speed things up a bit. I also, find that storing the original data in each
Node
and eachiTree
consumes a lot of needless memory. Would you be open to reviewing a Pull Request(s) that addressed both of these items? If so, would you accept them bundled together as one PR or would you like them separated?Thanks