Open sampan501 opened 3 months ago
I think in terms of sequential experiments to run:
RandomForestClassifier
in scikit-learn vs RandomForestClassifier
in scikit-tree in just n_samples
vs time to fit with n_jobs =1 vs n_jobs = -1If this doesn't look good, it means forsure our compiler is messed up somehow, or we introduce some serious issues in the fork that we're not aware of.
HonestForestClassifier
with DTC
from sklearn vs DTC
from scikit-tree. To determine if HonestForest introduces this issue somehowWithin each of the above, we would have to investigate CPU/RAM usage in-depth using valgrind, or something...
CoMIGHT before changes in #242
CoMIGHT after changes in #242
To confirm this is not an isolated issue with comight right? Or so far it is?
it is not
We ran some tests and after the fix Adam pushed the diff between RF and sktree-RF are:
Fit time for RandomForestClassifier: 3.522181987762451
Fit time for RandomForestClassifier: 3.4983439445495605
Fit time for RandomForestClassifier: 3.518531084060669
Fit time for RandomForestClassifier: 3.5076229572296143
Fit time for RandomForestClassifier: 3.5162460803985596
Fit time for sktreeRandomForestClassifier: 3.697654962539673
Fit time for sktreeRandomForestClassifier: 3.660207986831665
Fit time for sktreeRandomForestClassifier: 3.6615519523620605
Fit time for sktreeRandomForestClassifier: 3.6803948879241943
Fit time for sktreeRandomForestClassifier: 3.653079032897949
Note: the result for sktree-RF was 7sec+ prior to this fix.
The script for this test is found : https://github.com/neurodata/might/blob/cmi/exps/new_submission/Figure6_comight_vs_nsamples_ndims/test_rf_parallel.py
The commit that we tested to get ~3sec on sktree-RF was: https://github.com/neurodata/scikit-tree/pull/242/commits/7c756776fbaaf7b97390d30004322edf53f3c29d
wooot!!
Checklist
main
branch.pip freeze
.Description
There is occasional low CPU usage when using scikit-tree forests in parallel. Running the same code, in machines with many cores, I'm getting roughly 4-5% usage with scikit-tree forests and 60-70% using scikit-learn for the same types of problems. We should look into their Cython code optimizations and see how we can make improvements to our code base.