Machine learning API - Githubissues

solatis commented 7 years ago

As discussed on Clojurians, I would like to use this issue to start describing some machine learning functionality that could be achieved with Onyx, and/or what such an API would look like.

Scope

First of all, I think Onyx should not try to support all kinds of machine learning algorithms out there; the area is crowded, and I think there are certain types of algorithms that Onyx is better suitable for than others.

On my list of what I would like to see Onyx able to better facilitate, I'm going to pick a few samples that I'm personally very familiar with -- as such it's heavily biased, but it's a good starting point for the discussion:

Naive bayes, a statistical classification algorithm
K-Means, a clustering algorithm
Random Forests, a tree-based classification algorithm
Nearest neighbour, a highly-dimensional classification algorithm

And I'm going to leave on specifically out of scope:

Neural networks, a classification algorithm; it's a different category in itself that it's special enough for Google to build a custom processing unit for it, which is available for use in the Google Cloud using Tensorflow -- neural networks are best trained using tools that are completely dedicated to them.

Workflow

Let's separate the way we could use Onyx for ML in two different ways:

Using an existing model for predicting the outcome of a certain input; this is fairly straightforward, and doesn't require any additional facilities from Onyx
Training a new model; this is where things get interesting.

The typical workflow for training an ML algorithm looks as follows:

Split all input data in two different groups, training data and test data (say, 90%/10%)
Train the model; this can be either
- a predefined amount of steps (e.g. naive bayes and nearest neighbour algorithms)
- a virtually unlimited number of iterations, where we stop when we think things are "good enough" (e.g. k-means)
Often, these algorithms work as input/output for each other; for example, when using an enormous dataset, you often want to split it into different clusters using K-Means and then create multiple smaller models for each cluster
After we're done with training the model using the 90% training data, we evaluate model performance using the 10% test data and calculate a score as benchmark

Design

I'm not 100% sure about the actual API yet, but I can already see a few patterns here:

a way to easily split out the training vs test data
a window-like function, that is repeated an X amount of iterations until a certain time / amount of training iterations has passed
perhaps something with flow conditions and distributing model training over a number of peers based on clustering
in addition to having the trained model as output, the model's performance/score is also an important trait

I'm not sure whether these things are in/out of scope for Onyx, or belong in an onyx-ml plugin library; we would have to further explore this.

enragedginger commented 7 years ago

Thanks for putting this together. I'd also like to see us build out some kind of machine learning functionality on top of Onyx. I attempted to build out a Clojure wrapper for Tensorflow that would allow you to build Tensorflow graphs using idiomatic Clojure data structures and patterns (much like Onyx). The issue is that most of the useful functionality in Tensorflow is baked into the Python client library and is not available as part of the data model / C bindings.

One of the huge advantages that Onyx has over everything else in the world is that its API and data model are equivalent. Therefore, when building out any sort of ML functionality on Onyx (and I think this goes without saying), we should to adhere to that principle.

In my experience, most of my machine learning projects perform best using either random forest or (more recently) XGBoost. The other algorithms might be nice for academic purposes, but if we're looking to build something that the 30 or so people who know both Clojure and machine learning are looking to use, then we should probably just focus on those two algorithms. However, I'm open to suggestions.

Also, I think we should not rule out neural networks entirely. I think we should focus on XGBoost and random forest for now and then look at tackling neural networks to some degree after that.

solatis commented 7 years ago

I agree, random forests are one of the more useful algorithms out there and should cover a lot of ground. If we go one level of abstraction higher, and look at whether or not onyx-core needs any additional functionality to make developing these types of algorithms work, what are your thoughts on that ?

Due to Onyx' flexible nature, I'm fairly certain that most of these things can be offloaded to a separate library; perhaps we should just bite the bullet and start working on that, and see where that brings us.

MichaelDrogalis commented 7 years ago

ML isn't exactly my sweet spot -- but do let us know if you'd like estimates of how difficult any changes to core would be to support, or suggestions about how to structure the library.

enragedginger commented 7 years ago

Does anyone have links to any particularly helpful papers on random forest or XGBoost? I'm having a hard time finding anything that gives a clear cut explanation of the algorithms.

alanmarazzi commented 6 years ago

Hi! There's this monograph about random forest and its various declinations. Then there is Chapter 15 of Elements of Statistical Learning which is very good.

Unfortunately for XGBoost I can only suggest Chen's paper.

onyx-platform / onyx

Machine learning API #797

Scope

Workflow

Design