Open LukeMathWalker opened 4 years ago
Hi, I'm eager to help I'll take Linear regression, Lasso and ridge.
Cool! I worked a bit on linear regression a while ago - you can find a very vanilla implementation of it here: https://github.com/rust-ndarray/ndarray-examples/tree/master/linear_regression @Nimpruda
What does Normalization
mean, is it like sklearn's StandardScaler
or something else?
Exactly @InCogNiTo124.
This is an interesting project and I will work on the PCA implementation
I am the author of the friedrich crate which implements Gaussian Processes.
While it is still a work in progress, it is fully featured and I would be happy to help integrate it into the project if you have directions to do so.
That would be awesome @nestordemeure - I'll have a look at the project and I'll get back to you! Should I open an issue on friedrich
's repository when I am ready? Or would you prefer it to be tracked here on the linfa
repository?
Both are ok with me.
An issue in friedrich's repository might help avoid overcrowning linfa with issues but do as you prefer.
I'd love to take the Nearest Neighbors implementation
I think this is really great, I just started on a sklearn like implementation of their pipelines, here but more or less for experimentation without anything serious. I'll be sure to keep my eye on issues/goals here and help out where I can. Thanks for the initiative! :clap:
Hi there! First off, I don't have any experience in ML, but I read a lot about it (and listen to way too many podcasts on the topic). I'm interested in jumping in. I have quite some experience developing in Rust, and specifically high fidelity simulation tools (cf nyx and hifitime).
I wrote an Ant Colony Optimizer in Rust. ACOs are great for traversing graphs which represent a solution space, a problem which is considered NP hard if I'm not mistaken. Is that something used at all in ML? If so, would it be of interest to this library, or is there a greater interest (for now) to focus on the problems listed in the first post?
Cheers
Hi @ChristopherRabotin I've never heard of ACOs but as it's in relation with graphs you should check if it has any uses with Markov Chains.
So far, I haven't found how both can be used together. The closest I found was finding several papers which use Markov Chains to analyze ACOs.
I would like to take the Naive Bayes one.
I'll take on Gaussian Processes.
I'll put some work towards the text tokenization algorithms (CountVectorizer and TFIDF). I'm also extremely interested in a good SVM implementation in Rust. Whoever is working on that, let me know if you'd like some help or anything.
Please take a look at what is already out there before diving head down into a reimplementation @tyfarnan - I haven't had the time to look at friedrich by @nestordemeure yet (taking a break after the final push to release the blog post and related code 😅) but we should definitely start from there as well as the GP sub-module in rusty-machine.
@tyfarnan, don't hesitate to contact me via an issue on friedrich's repository once @LukeMathWalker has explicited what is expected of code that is integrated into Linfa and how this integration will be done.
I did a quick round up of crates that implement the algorithms listed on the roadmap. Probably missed quite a few too but this can be a good starting point.
It was just a quick search so I don't know how reliavent each crate is but I tried to make a note if the crate was old and unmaintained. Hopefully this can be useful for helping with algorithm design or saving us from having to reimplement something that is already there.
Tracking friedrich
<>linfa
integration here: https://github.com/nestordemeure/friedrich/issues/1
I have updated the Issue to make sure it's immediately clear who is working on what and what items are still looking for an owner 👍
hey @LukeMathWalker could you add me next to the normalization? I plan to do it by New Year's as I'm still not very experienced with Rust, but I have an idea how to implement it
Done @InCogNiTo124 :pray:
Started implementing DBScan in #12.
Also if there are suggestions Gaussian Mixture Models would be cool
Implementation of DBSCAN
merged to master - thanks @xd009642 :pray:
Hi, really cool project! I have a question concerning the scope: do you eventually want to have deep learning and reinforcement learning algorithms too? I guess I'm curious to know if adding them is the plan eventually, but you want to start with the easier stuff, or if you think along the line of the scikit dev themselves : here.
Either way, I'll be glad to help spread the rust gospel. Right know I'm going through the Reinforcement Learning book, and I will implement some of the algorithms; if that's in the scope of linfa, I'll be glad to try adding them to it. If not, I plan to read through Understanding Machine Learning afterwards, and thus will eventually reach some of the algorithms in the roadmap. Then I will help by implementing them. :)
From previous discussions deep learning etc is out of scope for the same reasons as it is for sci-kit. @LukeMathWalker might have more to say about it or reinforcement learning :smile:
Ok, thanks.
I would consider both of them to be out of scope for this project - it's already incredibly broad as it is right now :sweat_smile: I'd love to see something spawn up for reinforcement learning, especially gym environments!
Can you also include Non-Negative Matrix Factorization (NMF) in the list for pre-processing steps. Its a standard algorithm in NLP/audio enhancement and decomposes a matrix into the product of two positive valued matrices. (https://en.wikipedia.org/wiki/Non-negative_matrix_factorization)
One of the nice properties is that there is a simple incremental algorithm for solving the the problem, with simple modification for sparsity constraints.
Please take a look at the PR https://github.com/rust-ndarray/ndarray-linalg/pull/184 which adds TruncatedEig
and TruncatedSvd
routines to the library. Both are based on the LOBPCG algorithm and allow an iterative approach to eigenvalue and singular value decomposition. This is used in PCA, manifold learning (e.g. spectral clustering) and discriminant analysis and is therefore useful here too. The algorithm also supports sparse problems, because the operator is defined in a matrix free way. (the matrix A
is provided as a closure in the function call to LOBPCG)
Can you also add me to the spectral clustering task? Will try to implement classical Multidimensional Scaling. The t-SNE technique is interesting too, but requires more time because it is based on a custom optimization problem. Furthermore gaussian embedding is another interesting technique, I used recently, but requires at-least SGD for a single layer NN. See papers here:
Hi there!
I've been working on an implementation of decision trees here. It's still a WIP and needs documentation but it's a start at least. Once it's a bit more polished I can look at random forests also.
Hello, if possible, I would like to assume the implementation of ICA algorithm. My implementation will take this paper as a guide.
@VirtualSpaceman That would be great--any pull requests are very welcome
ordinary linear regression was added in #20, thanks to @Nimpruda and @paulkoerbitz
linear decision trees were added in #18, kudos to @mossbanay
fast Independent Component Analysis was added in #47, kudos to @VasanthakumarV
I'm gonna start working on OPTICS at some point soon :eyes:
Hello, I am working on an implementation of the Approximated DBSCAN algorithm here and I was wondering if that is something that could be interesting for this project. Right now it has all the basic functionalities implemented and I would happily make any changes to make it fit here
@Sauro98 I believe linfa currently an implementation of the vanilla DBSCAN algorithm here, but an approximate version would be a great addition! It would be great if you could open a pull request following that same style under the linfa-clustering
sub-crate (i.e. using the appx_dbscan
vs. dbscan
), in a way that maintains some closeness between the API for the existing algorithm and your own (Dbscan::predict(&hyperparams, &dataset);
-> AppxDbscan::predict(&hyperparams, &dataset);
), which I believe usually accepts data in the form of an ndarray &Array2<f64>
structure variant. I'd also think it would be really interesting to see benchmarks comparing performance between the two!
Hello, I've started to work on a port of sklearn Gaussian Mixture Model (mentioned by @xd009642 above). I would be happy to contribute to the linfa-clustering
sub-crate. Btw, thanks for the linfa initiative which is really promising.
Gaussian Mixture Models were added in #56, thanks to @relf
Gaussian naïve Bayes was added in #51, kudos to @VasanthakumarV
Fast K-medoids clustering (PAM, FasterPAM) implementations: https://crates.io/crates/kmedoids
Markov Chains was mentioned above, but I'd really like to see Hidden Markov Models. I think it'd fit better under the "supervised learning" set of algorithms, even though it has unsupervised applications as well.
I made a toy project with https://github.com/paulkernfeld/hmmm a while back and it seemed solid enough.
There are no (published) API docs, but the code itself is quite small, and very well documented.
the Partial Least Squares family was added in #95, thanks to @relf
I'd like to implement Random Forrest.
Preprocessing with normalisation, count-vectorizer and tf-idf merged in #93, kudos to @Sauro98
In terms of functionality, the mid-term end goal is to achieve an offering of ML algorithms and pre-processing routines comparable to what is currently available in Python's
scikit-learn
.These algorithms can either be:
In no particular order, focusing on the main gaps:
Clustering:
Preprocessing:
Supervised Learning:
friedrich
- tracking issue https://github.com/nestordemeure/friedrich/issues/1)The collection is on purpose loose and non-exhaustive, it will evolve over time - if there is an ML algorithm that you find yourself using often on a day to day, please feel free to contribute it :100: