Open VolodymyrOrlov opened 3 years ago
Hey @VolodymyrOrlov, is this still required? I'd be happy to look at this
@daraghk absolutely! Let me know if you ran into an issue while implementing this feature.
awesome! are the 'datasets' actively used? (e.g digits.rs) It doesn't seem to me like they are at the moment. I'm just wondering as I plan to use 'digits' to benchmark the implementing of parallelization here. I've had to make small changes to 'digits' regarding feature and target names. (Edit: I just found the examples repo where they're used)
I don't think it is being actively used. Feel free to adjust it to your needs
I couldn't find any indication of progress on this. I have started using smartcore for random forest classification. @daraghk do you have a branch anywhere where you have started to work on this? @VolodymyrOrlov do you have a good idea of how much work would be involved? I'm willing to throw some time at it, but I am generally busy. The training feels painfully slow compared to the other stages of my processing pipeline which are all parallelized.
@guygastineau PR was here. I think that it only needs the addition of maybe a rust feature flags or just disable that code for WASM targets. Also probably needs to solve conflicts with latest additions to development branch
I did a little spelunking. the method fit
, https://github.com/smartcorelib/smartcore/blob/35fe68e024a69ca9f3d63244caeccc740ffede44/src/ensemble/random_forest_classifier.rs#L454, looks to have a clear path forward for parallelization in the loop that samples and calls the associated function fit_weak_learner
, but parallelizing that loop will likely require trait bounds for Send and Sync to propagate outside of the function definition itself. I hope this won't be too far reaching.
There are some things I would like to know. Are the authors and maintainers of this project interested in the addition of the crate rayon as a dependency for such work, or would you all prefer hand-rolled parallelization using the primitives from std::thread? I can see advantages and disadvantages to each approach. If using rayon
I would refactor fit
in a "functional"/iterator style to fit its idioms. That might not be desired, and perhaps you all would prefer to avoid adding dependencies anyway. With std::thread
I would probably keep the loops, so the function wouldn't look so different.
I should check in the DenseMatrix implements Sync and Send. That might be a requirement here, but I would need to play with the codebase to figure that out.
I don't have strong opinion about either options, maybe other contributors or active users of this crate could have @VolodymyrOrlov @Mec-iS @montanalow
On the other hand, I think that we could change the internal implementation of fit
as long as it is always compatible to be used with the SupervisedEstimator
trait https://github.com/smartcorelib/smartcore/blob/35fe68e024a69ca9f3d63244caeccc740ffede44/src/api.rs#L29
So, WASM doesn't support parallelization, or does default rayon
just not work with WASM?
It is nice that so much of the work is already done, but I expect I'll run into various problems concerning integration and twiddling with feature flags (I primarily program in Haskell, so I am not fluent with all the rust build system features).
I will try set some time aside each week at work for moving this forward. I would spend some evening time, but I am directing the music for a musical through the next few weeks, and all my after work programming time is gone until that ends.
@morenol please feel free to poke me if I seem to fall of the project for too long.
Regarding WASM and multithreading I think that it depends on the runtime engine. In rayon repo they suggest a way to run that with wasm-bindgen https://github.com/rayon-rs/rayon/blob/d3d46c31a49bd1ccea4582043bc69631d1fe176b/README.md#usage-with-webassembly.
For wasmtime
it seems that there is still job to be done to support multithreading https://github.com/bytecodealliance/wasmtime/issues/888
For that reason is that we could make rayon
or any other dependency like that as optional through a feature flag.
A quick solution could be to add #[cfg(not(target_arch = "wasm32"))]
in the multithreading fit
implementation and put #[cfg(target_arch = "wasm32")]
in the wasm32 fit
implementation that could be just the current code
ok but then rayon
should be optional? how big is it as a library? Consider that the size of the binary is already 12Mb.
@Mec-iS excellent point. I was thinking about that when I read the PR. I will see if it is possible to have a parallelization feature that is set specifically not to work with the wasm target and which must be turned on only then including the rayon dependency. Hopefully the build system makes that possible.
ok. Also it would be nice to have a codebase-wise list of methods that can be made parallel so we can see how worth it is to introduce rayon
as a dependency in perspective. Adding a dependency is worth if it can be reused in many parts of the codebase.
@Mec-iS is there a master list of algorithms in the smartcore library somewhere? I am willing to check which ones are parallelizable.
There is a list of groups of algorithms in the documentation main page in the lib.rs
file, or you can build the documentation with cargo doc --no-deps --open
//! ### Supported algorithms
//! All machine learning algorithms are grouped into these broad categories:
//! * [Clustering](cluster/index.html), unsupervised clustering of unlabeled data.
//! * [Matrix Decomposition](decomposition/index.html), various methods for matrix decomposition.
//! * [Linear Models](linear/index.html), regression and classification methods where output is assumed to have linear relation to explanatory variables
//! * [Ensemble Models](ensemble/index.html), variety of regression and classification ensemble models
//! * [Tree-based Models](tree/index.html), classification and regression trees
//! * [Nearest Neighbors](neighbors/index.html), K Nearest Neighbors for classification and regression
//! * [Naive Bayes](naive_bayes/index.html), statistical classification technique based on Bayes Theorem
//! * [SVM](svm/index.html), support vector machines
The complete list is more or less the modules names, like in algorithm/neighbour
there are:
bbd_tree
cover_tree
fastpair
linear_search
Depending on your system you can print the directory structure, in Linux is the tree
command. If you print the directory structure in the src
directory with all its subdirs you will see the list of all the modules (I removed the ones not containing algorithms):
src/
├── algorithm
│ ├── mod.rs
│ ├── neighbour
│ │ ├── bbd_tree.rs
│ │ ├── cover_tree.rs
│ │ ├── fastpair.rs
│ │ ├── linear_search.rs
│ │ └── mod.rs
│ └── sort
│ ├── heap_select.rs
│ ├── mod.rs
│ └── quick_sort.rs
├── cluster
│ ├── dbscan.rs
│ ├── kmeans.rs
│ └── mod.rs
├── decomposition
│ ├── mod.rs
│ ├── pca.rs
│ └── svd.rs
├── ensemble
│ ├── mod.rs
│ ├── random_forest_classifier.rs
│ └── random_forest_regressor.rs
├── lib.rs
├── linalg
│ └── traits
│ ├── cholesky.rs
│ ├── evd.rs
│ ├── high_order.rs
│ ├── lu.rs
│ ├── mod.rs
│ ├── qr.rs
│ ├── stats.rs
│ └── svd.rs
├── linear
│ ├── bg_solver.rs
│ ├── elastic_net.rs
│ ├── lasso_optimizer.rs
│ ├── lasso.rs
│ ├── linear_regression.rs
│ ├── logistic_regression.rs
│ ├── mod.rs
│ └── ridge_regression.rs
├── metrics
│ ├── accuracy.rs
│ ├── auc.rs
│ ├── cluster_hcv.rs
│ ├── cluster_helpers.rs
│ ├── distance
│ │ ├── euclidian.rs
│ │ ├── hamming.rs
│ │ ├── mahalanobis.rs
│ │ ├── manhattan.rs
│ │ ├── minkowski.rs
│ │ └── mod.rs
│ ├── f1.rs
│ ├── mean_absolute_error.rs
│ ├── mean_squared_error.rs
│ ├── mod.rs
│ ├── precision.rs
│ ├── r2.rs
│ └── recall.rs
├── model_selection
│ ├── kfold.rs
│ └── mod.rs
├── naive_bayes
│ ├── bernoulli.rs
│ ├── categorical.rs
│ ├── gaussian.rs
│ ├── mod.rs
│ └── multinomial.rs
├── neighbors
│ ├── knn_classifier.rs
│ ├── knn_regressor.rs
│ └── mod.rs
├── optimization
│ ├── first_order
│ │ ├── gradient_descent.rs
│ │ ├── lbfgs.rs
│ │ └── mod.rs
│ ├── line_search.rs
│ └── mod.rs
├── preprocessing
│ ├── categorical.rs
│ ├── mod.rs
│ ├── numerical.rs
│ ├── series_encoder.rs
│ └── traits.rs
├── svm
│ ├── mod.rs
│ ├── search
│ │ ├── mod.rs
│ │ ├── svc_params.rs
│ │ └── svr_params.rs
│ ├── svc.rs
│ └── svr.rs
└── tree
├── decision_tree_classifier.rs
├── decision_tree_regressor.rs
└── mod.rs
Oh, thank you. I thought tree
might do the trick, but I wasn't sure if the module structure lined up so well with the algorithms. Thank you for the information. You all have organized this project very well.
BTW, I remembered to run my smartcore
projects using release, and the speedup is insane. That significantly reduces how urgently I want/need parallelization, but I will keep on this issue, since I think I should contribute in some way as a user to say thanks to all of you maintaining this excellent library.
the new version 0.3 is out today with support for Rust 2021, see improvements in docs: https://docs.rs/smartcore/latest/smartcore/
Check the list of issues and use Github Discussions if you want to propose larger improvements https://github.com/smartcorelib/smartcore/discussions
@guygastineau we also published smartcore-benches
if you want to run or contribute some performance tests.
@guygastineau we also published
smartcore-benches
if you want to run or contribute some performance tests.
I will take a look. BTW, should the list of parallelizable algorithms start in this thread/issue, or should we begin that somewhere else or in a new issue and reference this one? I remember seeing a link somewhere in our conversation here the other day (I think, maybe I was somewhere else in smartcore's repo/github), and it suggested improvement suggestions should go somewhere for discussion, but I can't find it now. I am feeling a bit brain dead. We have had rehearsals after work from 7-10/11 every night this week (for the musical I mentioned previously), and I'll have shows and rehearsals straight through next Saturday (so I will continue to be brain dead for at least another week).
yes you can create a new Discussion at https://github.com/smartcorelib/smartcore/discussions
with your general ideas for what you want to do. Who is interested can join the discussion and try to integrate own ideas. Once the needs for the implementation are clear, we can move on and create issues.
Has there been any progress on random forest parallelization lately, or a branch that someone has started?
Right now neither random_forest_classifier nor random_forest_regressor compute and evaluate trees in parallel. This can be easily changed since each tree in the ensemble does not depend on other trees.
We also need a new parameter
n_jobs
(similar to the parameter with the same name in Scikit Learn) that will control level of parallelization.