Roadmap - Githubissues

LukeMathWalker commented 4 years ago

In terms of functionality, the mid-term end goal is to achieve an offering of ML algorithms and pre-processing routines comparable to what is currently available in Python's scikit-learn.

These algorithms can either be:

re-implemented in Rust;
re-exported from an existing Rust crate, if available on crates.io with a compatible interface.

In no particular order, focusing on the main gaps:

Clustering:
- [x] DBSCAN
- [x] Spectral clustering;
- [x] Hierarchical clustering;
- [x] OPTICS.
Preprocessing:
- [x] PCA
- [x] ICA
- [x] Normalisation
- [x] CountVectoriser
- [x] TFIDF
- [x] t-SNE
Supervised Learning:
- [x] Linear regression;
- [x] Ridge regression;
- [x] LASSO;
- [x] ElasticNet;
- [x] Support vector machines;
- [x] Nearest Neighbours;
- [ ] Gaussian processes; (integrating friedrich - tracking issue https://github.com/nestordemeure/friedrich/issues/1)
- [x] Decision trees;
- [ ] Random Forest
- [x] Naive Bayes
- [x] Logistic Regression
- [ ] Ensemble Learning
- [ ] Least Angle Regression
- [x] PLS

The collection is on purpose loose and non-exhaustive, it will evolve over time - if there is an ML algorithm that you find yourself using often on a day to day, please feel free to contribute it :100:

Nimpruda commented 4 years ago

Hi, I'm eager to help I'll take Linear regression, Lasso and ridge.

LukeMathWalker commented 4 years ago

Cool! I worked a bit on linear regression a while ago - you can find a very vanilla implementation of it here: https://github.com/rust-ndarray/ndarray-examples/tree/master/linear_regression @Nimpruda

InCogNiTo124 commented 4 years ago

What does Normalization mean, is it like sklearn's StandardScaler or something else?

LukeMathWalker commented 4 years ago

Exactly @InCogNiTo124.

ADMoreau commented 4 years ago

This is an interesting project and I will work on the PCA implementation

nestordemeure commented 4 years ago

I am the author of the friedrich crate which implements Gaussian Processes.

While it is still a work in progress, it is fully featured and I would be happy to help integrate it into the project if you have directions to do so.

LukeMathWalker commented 4 years ago

That would be awesome @nestordemeure - I'll have a look at the project and I'll get back to you! Should I open an issue on friedrich's repository when I am ready? Or would you prefer it to be tracked here on the linfa repository?

nestordemeure commented 4 years ago

Both are ok with me.

An issue in friedrich's repository might help avoid overcrowning linfa with issues but do as you prefer.

mstallmo commented 4 years ago

I'd love to take the Nearest Neighbors implementation

milesgranger commented 4 years ago

I think this is really great, I just started on a sklearn like implementation of their pipelines, here but more or less for experimentation without anything serious. I'll be sure to keep my eye on issues/goals here and help out where I can. Thanks for the initiative! :clap:

ChristopherRabotin commented 4 years ago

Hi there! First off, I don't have any experience in ML, but I read a lot about it (and listen to way too many podcasts on the topic). I'm interested in jumping in. I have quite some experience developing in Rust, and specifically high fidelity simulation tools (cf nyx and hifitime).

I wrote an Ant Colony Optimizer in Rust. ACOs are great for traversing graphs which represent a solution space, a problem which is considered NP hard if I'm not mistaken. Is that something used at all in ML? If so, would it be of interest to this library, or is there a greater interest (for now) to focus on the problems listed in the first post?

Cheers

Nimpruda commented 4 years ago

Hi @ChristopherRabotin I've never heard of ACOs but as it's in relation with graphs you should check if it has any uses with Markov Chains.

ChristopherRabotin commented 4 years ago

So far, I haven't found how both can be used together. The closest I found was finding several papers which use Markov Chains to analyze ACOs.

onehr commented 4 years ago

I would like to take the Naive Bayes one.

tyfarnan commented 4 years ago

I'll take on Gaussian Processes.

bplevin36 commented 4 years ago

I'll put some work towards the text tokenization algorithms (CountVectorizer and TFIDF). I'm also extremely interested in a good SVM implementation in Rust. Whoever is working on that, let me know if you'd like some help or anything.

LukeMathWalker commented 4 years ago

Please take a look at what is already out there before diving head down into a reimplementation @tyfarnan - I haven't had the time to look at friedrich by @nestordemeure yet (taking a break after the final push to release the blog post and related code 😅) but we should definitely start from there as well as the GP sub-module in rusty-machine.

nestordemeure commented 4 years ago

@tyfarnan, don't hesitate to contact me via an issue on friedrich's repository once @LukeMathWalker has explicited what is expected of code that is integrated into Linfa and how this integration will be done.

DallasC commented 4 years ago

I did a quick round up of crates that implement the algorithms listed on the roadmap. Probably missed quite a few too but this can be a good starting point.

It was just a quick search so I don't know how reliavent each crate is but I tried to make a note if the crate was old and unmaintained. Hopefully this can be useful for helping with algorithm design or saving us from having to reimplement something that is already there.

Algo ecosystem gist

LukeMathWalker commented 4 years ago

Tracking friedrich<>linfa integration here: https://github.com/nestordemeure/friedrich/issues/1

LukeMathWalker commented 4 years ago

I have updated the Issue to make sure it's immediately clear who is working on what and what items are still looking for an owner 👍

InCogNiTo124 commented 4 years ago

hey @LukeMathWalker could you add me next to the normalization? I plan to do it by New Year's as I'm still not very experienced with Rust, but I have an idea how to implement it

LukeMathWalker commented 4 years ago

Done @InCogNiTo124 :pray:

xd009642 commented 4 years ago

Started implementing DBScan in #12.

Also if there are suggestions Gaussian Mixture Models would be cool

LukeMathWalker commented 4 years ago

Implementation of DBSCAN merged to master - thanks @xd009642 :pray:

adamShimi commented 4 years ago

Hi, really cool project! I have a question concerning the scope: do you eventually want to have deep learning and reinforcement learning algorithms too? I guess I'm curious to know if adding them is the plan eventually, but you want to start with the easier stuff, or if you think along the line of the scikit dev themselves : here.

Either way, I'll be glad to help spread the rust gospel. Right know I'm going through the Reinforcement Learning book, and I will implement some of the algorithms; if that's in the scope of linfa, I'll be glad to try adding them to it. If not, I plan to read through Understanding Machine Learning afterwards, and thus will eventually reach some of the algorithms in the roadmap. Then I will help by implementing them. :)

xd009642 commented 4 years ago

From previous discussions deep learning etc is out of scope for the same reasons as it is for sci-kit. @LukeMathWalker might have more to say about it or reinforcement learning :smile:

adamShimi commented 4 years ago

Ok, thanks.

LukeMathWalker commented 4 years ago

I would consider both of them to be out of scope for this project - it's already incredibly broad as it is right now :sweat_smile: I'd love to see something spawn up for reinforcement learning, especially gym environments!

bytesnake commented 4 years ago

Can you also include Non-Negative Matrix Factorization (NMF) in the list for pre-processing steps. Its a standard algorithm in NLP/audio enhancement and decomposes a matrix into the product of two positive valued matrices. (https://en.wikipedia.org/wiki/Non-negative_matrix_factorization)

One of the nice properties is that there is a simple incremental algorithm for solving the the problem, with simple modification for sparsity constraints.

bytesnake commented 4 years ago

For hierarchical clustering there is the wonderful kodama crate. Its based on this paper and implements a list of algorithms for hierarchical clustering (and chooses the fastest one). I think it would be a waste to re-implement them. Perhaps we can just re-export it in a module?

bytesnake commented 4 years ago

Please take a look at the PR https://github.com/rust-ndarray/ndarray-linalg/pull/184 which adds TruncatedEig and TruncatedSvd routines to the library. Both are based on the LOBPCG algorithm and allow an iterative approach to eigenvalue and singular value decomposition. This is used in PCA, manifold learning (e.g. spectral clustering) and discriminant analysis and is therefore useful here too. The algorithm also supports sparse problems, because the operator is defined in a matrix free way. (the matrix A is provided as a closure in the function call to LOBPCG)

bytesnake commented 4 years ago

Can you also add me to the spectral clustering task? Will try to implement classical Multidimensional Scaling. The t-SNE technique is interesting too, but requires more time because it is based on a custom optimization problem. Furthermore gaussian embedding is another interesting technique, I used recently, but requires at-least SGD for a single layer NN. See papers here:

mossbanay commented 4 years ago

Hi there!

I've been working on an implementation of decision trees here. It's still a WIP and needs documentation but it's a start at least. Once it's a bit more polished I can look at random forests also.

VirtualSpaceman commented 3 years ago

Hello, if possible, I would like to assume the implementation of ICA algorithm. My implementation will take this paper as a guide.

quietlychris commented 3 years ago

@VirtualSpaceman That would be great--any pull requests are very welcome

bytesnake commented 3 years ago

ordinary linear regression was added in #20, thanks to @Nimpruda and @paulkoerbitz

bytesnake commented 3 years ago

linear decision trees were added in #18, kudos to @mossbanay

bytesnake commented 3 years ago

fast Independent Component Analysis was added in #47, kudos to @VasanthakumarV

xd009642 commented 3 years ago

I'm gonna start working on OPTICS at some point soon :eyes:

Sauro98 commented 3 years ago

Hello, I am working on an implementation of the Approximated DBSCAN algorithm here and I was wondering if that is something that could be interesting for this project. Right now it has all the basic functionalities implemented and I would happily make any changes to make it fit here

quietlychris commented 3 years ago

@Sauro98 I believe linfa currently an implementation of the vanilla DBSCAN algorithm here, but an approximate version would be a great addition! It would be great if you could open a pull request following that same style under the linfa-clustering sub-crate (i.e. using the appx_dbscan vs. dbscan), in a way that maintains some closeness between the API for the existing algorithm and your own (Dbscan::predict(&hyperparams, &dataset); -> AppxDbscan::predict(&hyperparams, &dataset);), which I believe usually accepts data in the form of an ndarray &Array2<f64> structure variant. I'd also think it would be really interesting to see benchmarks comparing performance between the two!

relf commented 3 years ago

Hello, I've started to work on a port of sklearn Gaussian Mixture Model (mentioned by @xd009642 above). I would be happy to contribute to the linfa-clustering sub-crate. Btw, thanks for the linfa initiative which is really promising.

bytesnake commented 3 years ago

Gaussian Mixture Models were added in #56, thanks to @relf

bytesnake commented 3 years ago

Gaussian naïve Bayes was added in #51, kudos to @VasanthakumarV

kno10 commented 3 years ago

Fast K-medoids clustering (PAM, FasterPAM) implementations: https://crates.io/crates/kmedoids

rrichardson commented 3 years ago

Markov Chains was mentioned above, but I'd really like to see Hidden Markov Models. I think it'd fit better under the "supervised learning" set of algorithms, even though it has unsupervised applications as well.
I made a toy project with https://github.com/paulkernfeld/hmmm a while back and it seemed solid enough. There are no (published) API docs, but the code itself is quite small, and very well documented.

bytesnake commented 3 years ago

the Partial Least Squares family was added in #95, thanks to @relf

jkabc123 commented 3 years ago

I'd like to implement Random Forrest.

bytesnake commented 3 years ago

Preprocessing with normalisation, count-vectorizer and tf-idf merged in #93, kudos to @Sauro98

rust-ml / linfa

Roadmap #7