Optimizations for scikit-tree to improve multi-core performance

sampan501 commented 3 months ago

Checklist

[ ] I have verified that the issue exists against the main branch.
[ ] I have read the relevant section in the contribution guide on reporting bugs.
[ ] I have checked the issues list for similar or identical bug reports.
[ ] I have checked the pull requests list for existing proposed fixes.
[ ] I have checked the CHANGELOG and the commit log to find out if the bug was already fixed in the main branch.
[ ] I have included in the "Description" section below a traceback from any exceptions related to this bug.
[ ] I have included in the "Related issues or possible duplicates" section beloew all related issues and possible duplicate issues (If there are none, check this box anyway).
[ ] I have included in the "Environment" section below the name of the operating system and Python version that I was using when I discovered this bug.
[ ] I have included in the "Environment" section below the output of pip freeze.
[ ] I have included in the "Steps to reproduce" section below a minimally reproducible example.

Description

There is occasional low CPU usage when using scikit-tree forests in parallel. Running the same code, in machines with many cores, I'm getting roughly 4-5% usage with scikit-tree forests and 60-70% using scikit-learn for the same types of problems. We should look into their Cython code optimizations and see how we can make improvements to our code base.

adam2392 commented 3 months ago

I think in terms of sequential experiments to run:

RandomForestClassifier in scikit-learn vs RandomForestClassifier in scikit-tree in just n_samples vs time to fit with n_jobs =1 vs n_jobs = -1

If this doesn't look good, it means forsure our compiler is messed up somehow, or we introduce some serious issues in the fork that we're not aware of.

Wrap HonestForestClassifier with DTC from sklearn vs DTC from scikit-tree. To determine if HonestForest introduces this issue somehow

Within each of the above, we would have to investigate CPU/RAM usage in-depth using valgrind, or something...

sampan501 commented 3 months ago

image (1)

sampan501 commented 3 months ago

CoMIGHT before changes in #242

CoMIGHT after changes in #242

adam2392 commented 3 months ago

To confirm this is not an isolated issue with comight right? Or so far it is?

sampan501 commented 3 months ago

it is not

SUKI-O commented 3 months ago

We ran some tests and after the fix Adam pushed the diff between RF and sktree-RF are:

Fit time for RandomForestClassifier: 3.522181987762451
Fit time for RandomForestClassifier: 3.4983439445495605
Fit time for RandomForestClassifier: 3.518531084060669
Fit time for RandomForestClassifier: 3.5076229572296143
Fit time for RandomForestClassifier: 3.5162460803985596
Fit time for sktreeRandomForestClassifier: 3.697654962539673
Fit time for sktreeRandomForestClassifier: 3.660207986831665
Fit time for sktreeRandomForestClassifier: 3.6615519523620605
Fit time for sktreeRandomForestClassifier: 3.6803948879241943
Fit time for sktreeRandomForestClassifier: 3.653079032897949

Note: the result for sktree-RF was 7sec+ prior to this fix.

The script for this test is found : https://github.com/neurodata/might/blob/cmi/exps/new_submission/Figure6_comight_vs_nsamples_ndims/test_rf_parallel.py

The commit that we tested to get ~3sec on sktree-RF was: https://github.com/neurodata/scikit-tree/pull/242/commits/7c756776fbaaf7b97390d30004322edf53f3c29d

sampan501 commented 3 months ago

wooot!!

neurodata / scikit-tree

Optimizations for scikit-tree to improve multi-core performance #245

Checklist

Description