Parallelize Random Forest training and evaluation

VolodymyrOrlov commented 3 years ago

Right now neither random_forest_classifier nor random_forest_regressor compute and evaluate trees in parallel. This can be easily changed since each tree in the ensemble does not depend on other trees.

We also need a new parameter n_jobs (similar to the parameter with the same name in Scikit Learn) that will control level of parallelization.

daraghk commented 2 years ago

Hey @VolodymyrOrlov, is this still required? I'd be happy to look at this

VolodymyrOrlov commented 2 years ago

@daraghk absolutely! Let me know if you ran into an issue while implementing this feature.

daraghk commented 2 years ago

awesome! are the 'datasets' actively used? (e.g digits.rs) It doesn't seem to me like they are at the moment. I'm just wondering as I plan to use 'digits' to benchmark the implementing of parallelization here. I've had to make small changes to 'digits' regarding feature and target names. (Edit: I just found the examples repo where they're used)

VolodymyrOrlov commented 2 years ago

I don't think it is being actively used. Feel free to adjust it to your needs

guygastineau commented 2 years ago

I couldn't find any indication of progress on this. I have started using smartcore for random forest classification. @daraghk do you have a branch anywhere where you have started to work on this? @VolodymyrOrlov do you have a good idea of how much work would be involved? I'm willing to throw some time at it, but I am generally busy. The training feels painfully slow compared to the other stages of my processing pipeline which are all parallelized.

morenol commented 2 years ago

@guygastineau PR was here. I think that it only needs the addition of maybe a rust feature flags or just disable that code for WASM targets. Also probably needs to solve conflicts with latest additions to development branch

guygastineau commented 2 years ago

I did a little spelunking. the method fit, https://github.com/smartcorelib/smartcore/blob/35fe68e024a69ca9f3d63244caeccc740ffede44/src/ensemble/random_forest_classifier.rs#L454, looks to have a clear path forward for parallelization in the loop that samples and calls the associated function fit_weak_learner, but parallelizing that loop will likely require trait bounds for Send and Sync to propagate outside of the function definition itself. I hope this won't be too far reaching.

There are some things I would like to know. Are the authors and maintainers of this project interested in the addition of the crate rayon as a dependency for such work, or would you all prefer hand-rolled parallelization using the primitives from std::thread? I can see advantages and disadvantages to each approach. If using rayon I would refactor fit in a "functional"/iterator style to fit its idioms. That might not be desired, and perhaps you all would prefer to avoid adding dependencies anyway. With std::thread I would probably keep the loops, so the function wouldn't look so different.

I should check in the DenseMatrix implements Sync and Send. That might be a requirement here, but I would need to play with the codebase to figure that out.

morenol commented 2 years ago

I don't have strong opinion about either options, maybe other contributors or active users of this crate could have @VolodymyrOrlov @Mec-iS @montanalow

On the other hand, I think that we could change the internal implementation of fit as long as it is always compatible to be used with the SupervisedEstimator trait https://github.com/smartcorelib/smartcore/blob/35fe68e024a69ca9f3d63244caeccc740ffede44/src/api.rs#L29

guygastineau commented 2 years ago

So, WASM doesn't support parallelization, or does default rayon just not work with WASM?

It is nice that so much of the work is already done, but I expect I'll run into various problems concerning integration and twiddling with feature flags (I primarily program in Haskell, so I am not fluent with all the rust build system features).

I will try set some time aside each week at work for moving this forward. I would spend some evening time, but I am directing the music for a musical through the next few weeks, and all my after work programming time is gone until that ends.

@morenol please feel free to poke me if I seem to fall of the project for too long.

morenol commented 2 years ago

Regarding WASM and multithreading I think that it depends on the runtime engine. In rayon repo they suggest a way to run that with wasm-bindgen https://github.com/rayon-rs/rayon/blob/d3d46c31a49bd1ccea4582043bc69631d1fe176b/README.md#usage-with-webassembly.

For wasmtime it seems that there is still job to be done to support multithreading https://github.com/bytecodealliance/wasmtime/issues/888

For that reason is that we could make rayon or any other dependency like that as optional through a feature flag.

morenol commented 2 years ago

A quick solution could be to add #[cfg(not(target_arch = "wasm32"))] in the multithreading fit implementation and put #[cfg(target_arch = "wasm32")] in the wasm32 fit implementation that could be just the current code

Mec-iS commented 2 years ago

ok but then rayon should be optional? how big is it as a library? Consider that the size of the binary is already 12Mb.

guygastineau commented 2 years ago

@Mec-iS excellent point. I was thinking about that when I read the PR. I will see if it is possible to have a parallelization feature that is set specifically not to work with the wasm target and which must be turned on only then including the rayon dependency. Hopefully the build system makes that possible.

Mec-iS commented 2 years ago

ok. Also it would be nice to have a codebase-wise list of methods that can be made parallel so we can see how worth it is to introduce rayon as a dependency in perspective. Adding a dependency is worth if it can be reused in many parts of the codebase.

guygastineau commented 2 years ago

@Mec-iS is there a master list of algorithms in the smartcore library somewhere? I am willing to check which ones are parallelizable.

Mec-iS commented 2 years ago

There is a list of groups of algorithms in the documentation main page in the lib.rs file, or you can build the documentation with cargo doc --no-deps --open

//! ### Supported algorithms
//! All machine learning algorithms are grouped into these broad categories:
//! * [Clustering](cluster/index.html), unsupervised clustering of unlabeled data.
//! * [Matrix Decomposition](decomposition/index.html), various methods for matrix decomposition.
//! * [Linear Models](linear/index.html), regression and classification methods where output is assumed to have linear relation to explanatory variables
//! * [Ensemble Models](ensemble/index.html), variety of regression and classification ensemble models
//! * [Tree-based Models](tree/index.html), classification and regression trees
//! * [Nearest Neighbors](neighbors/index.html), K Nearest Neighbors for classification and regression
//! * [Naive Bayes](naive_bayes/index.html), statistical classification technique based on Bayes Theorem
//! * [SVM](svm/index.html), support vector machines

The complete list is more or less the modules names, like in algorithm/neighbour there are:

bbd_tree
cover_tree
fastpair
linear_search

Depending on your system you can print the directory structure, in Linux is the tree command. If you print the directory structure in the src directory with all its subdirs you will see the list of all the modules (I removed the ones not containing algorithms):

src/
├── algorithm
│   ├── mod.rs
│   ├── neighbour
│   │   ├── bbd_tree.rs
│   │   ├── cover_tree.rs
│   │   ├── fastpair.rs
│   │   ├── linear_search.rs
│   │   └── mod.rs
│   └── sort
│       ├── heap_select.rs
│       ├── mod.rs
│       └── quick_sort.rs
├── cluster
│   ├── dbscan.rs
│   ├── kmeans.rs
│   └── mod.rs
├── decomposition
│   ├── mod.rs
│   ├── pca.rs
│   └── svd.rs
├── ensemble
│   ├── mod.rs
│   ├── random_forest_classifier.rs
│   └── random_forest_regressor.rs
├── lib.rs
├── linalg
│   └── traits
│       ├── cholesky.rs
│       ├── evd.rs
│       ├── high_order.rs
│       ├── lu.rs
│       ├── mod.rs
│       ├── qr.rs
│       ├── stats.rs
│       └── svd.rs
├── linear
│   ├── bg_solver.rs
│   ├── elastic_net.rs
│   ├── lasso_optimizer.rs
│   ├── lasso.rs
│   ├── linear_regression.rs
│   ├── logistic_regression.rs
│   ├── mod.rs
│   └── ridge_regression.rs
├── metrics
│   ├── accuracy.rs
│   ├── auc.rs
│   ├── cluster_hcv.rs
│   ├── cluster_helpers.rs
│   ├── distance
│   │   ├── euclidian.rs
│   │   ├── hamming.rs
│   │   ├── mahalanobis.rs
│   │   ├── manhattan.rs
│   │   ├── minkowski.rs
│   │   └── mod.rs
│   ├── f1.rs
│   ├── mean_absolute_error.rs
│   ├── mean_squared_error.rs
│   ├── mod.rs
│   ├── precision.rs
│   ├── r2.rs
│   └── recall.rs
├── model_selection
│   ├── kfold.rs
│   └── mod.rs
├── naive_bayes
│   ├── bernoulli.rs
│   ├── categorical.rs
│   ├── gaussian.rs
│   ├── mod.rs
│   └── multinomial.rs
├── neighbors
│   ├── knn_classifier.rs
│   ├── knn_regressor.rs
│   └── mod.rs
├── optimization
│   ├── first_order
│   │   ├── gradient_descent.rs
│   │   ├── lbfgs.rs
│   │   └── mod.rs
│   ├── line_search.rs
│   └── mod.rs
├── preprocessing
│   ├── categorical.rs
│   ├── mod.rs
│   ├── numerical.rs
│   ├── series_encoder.rs
│   └── traits.rs
├── svm
│   ├── mod.rs
│   ├── search
│   │   ├── mod.rs
│   │   ├── svc_params.rs
│   │   └── svr_params.rs
│   ├── svc.rs
│   └── svr.rs
└── tree
    ├── decision_tree_classifier.rs
    ├── decision_tree_regressor.rs
    └── mod.rs

guygastineau commented 2 years ago

Oh, thank you. I thought tree might do the trick, but I wasn't sure if the module structure lined up so well with the algorithms. Thank you for the information. You all have organized this project very well.

BTW, I remembered to run my smartcore projects using release, and the speedup is insane. That significantly reduces how urgently I want/need parallelization, but I will keep on this issue, since I think I should contribute in some way as a user to say thanks to all of you maintaining this excellent library.

Mec-iS commented 2 years ago

the new version 0.3 is out today with support for Rust 2021, see improvements in docs: https://docs.rs/smartcore/latest/smartcore/

Check the list of issues and use Github Discussions if you want to propose larger improvements https://github.com/smartcorelib/smartcore/discussions

Mec-iS commented 2 years ago

@guygastineau we also published smartcore-benches if you want to run or contribute some performance tests.

guygastineau commented 2 years ago

@guygastineau we also published smartcore-benches if you want to run or contribute some performance tests.

I will take a look. BTW, should the list of parallelizable algorithms start in this thread/issue, or should we begin that somewhere else or in a new issue and reference this one? I remember seeing a link somewhere in our conversation here the other day (I think, maybe I was somewhere else in smartcore's repo/github), and it suggested improvement suggestions should go somewhere for discussion, but I can't find it now. I am feeling a bit brain dead. We have had rehearsals after work from 7-10/11 every night this week (for the musical I mentioned previously), and I'll have shows and rehearsals straight through next Saturday (so I will continue to be brain dead for at least another week).

Mec-iS commented 2 years ago

yes you can create a new Discussion at https://github.com/smartcorelib/smartcore/discussions

with your general ideas for what you want to do. Who is interested can join the discussion and try to integrate own ideas. Once the needs for the implementation are clear, we can move on and create issues.

thegwan commented 1 year ago

Has there been any progress on random forest parallelization lately, or a branch that someone has started?

smartcorelib / smartcore

Parallelize Random Forest training and evaluation #60