neurodata / scikit-tree

Scikit-learn compatible decision trees beyond those offered in scikit-learn
https://docs.neurodata.io/scikit-tree/dev/index.html
Other
54 stars 13 forks source link

Optimizations for scikit-tree to improve multi-core performance #245

Open sampan501 opened 3 months ago

sampan501 commented 3 months ago

Checklist

Description

There is occasional low CPU usage when using scikit-tree forests in parallel. Running the same code, in machines with many cores, I'm getting roughly 4-5% usage with scikit-tree forests and 60-70% using scikit-learn for the same types of problems. We should look into their Cython code optimizations and see how we can make improvements to our code base.

adam2392 commented 3 months ago

I think in terms of sequential experiments to run:

  1. RandomForestClassifier in scikit-learn vs RandomForestClassifier in scikit-tree in just n_samples vs time to fit with n_jobs =1 vs n_jobs = -1

If this doesn't look good, it means forsure our compiler is messed up somehow, or we introduce some serious issues in the fork that we're not aware of.

  1. Wrap HonestForestClassifier with DTC from sklearn vs DTC from scikit-tree. To determine if HonestForest introduces this issue somehow

Within each of the above, we would have to investigate CPU/RAM usage in-depth using valgrind, or something...

sampan501 commented 3 months ago

image (1) image

sampan501 commented 3 months ago

image (1)

CoMIGHT before changes in #242

image

CoMIGHT after changes in #242

adam2392 commented 3 months ago

To confirm this is not an isolated issue with comight right? Or so far it is?

sampan501 commented 3 months ago

it is not

SUKI-O commented 3 months ago

We ran some tests and after the fix Adam pushed the diff between RF and sktree-RF are:

Fit time for RandomForestClassifier: 3.522181987762451
Fit time for RandomForestClassifier: 3.4983439445495605
Fit time for RandomForestClassifier: 3.518531084060669
Fit time for RandomForestClassifier: 3.5076229572296143
Fit time for RandomForestClassifier: 3.5162460803985596
Fit time for sktreeRandomForestClassifier: 3.697654962539673
Fit time for sktreeRandomForestClassifier: 3.660207986831665
Fit time for sktreeRandomForestClassifier: 3.6615519523620605
Fit time for sktreeRandomForestClassifier: 3.6803948879241943
Fit time for sktreeRandomForestClassifier: 3.653079032897949

Note: the result for sktree-RF was 7sec+ prior to this fix.

The script for this test is found : https://github.com/neurodata/might/blob/cmi/exps/new_submission/Figure6_comight_vs_nsamples_ndims/test_rf_parallel.py

The commit that we tested to get ~3sec on sktree-RF was: https://github.com/neurodata/scikit-tree/pull/242/commits/7c756776fbaaf7b97390d30004322edf53f3c29d

sampan501 commented 3 months ago

wooot!!