onyx-platform / onyx

Distributed, masterless, high performance, fault tolerant data processing
http://www.onyxplatform.org
Eclipse Public License 1.0
2.05k stars 204 forks source link

Machine learning API #797

Open solatis opened 7 years ago

solatis commented 7 years ago

As discussed on Clojurians, I would like to use this issue to start describing some machine learning functionality that could be achieved with Onyx, and/or what such an API would look like.

Scope

First of all, I think Onyx should not try to support all kinds of machine learning algorithms out there; the area is crowded, and I think there are certain types of algorithms that Onyx is better suitable for than others.

On my list of what I would like to see Onyx able to better facilitate, I'm going to pick a few samples that I'm personally very familiar with -- as such it's heavily biased, but it's a good starting point for the discussion:

And I'm going to leave on specifically out of scope:

Workflow

Let's separate the way we could use Onyx for ML in two different ways:

The typical workflow for training an ML algorithm looks as follows:

Design

I'm not 100% sure about the actual API yet, but I can already see a few patterns here:

I'm not sure whether these things are in/out of scope for Onyx, or belong in an onyx-ml plugin library; we would have to further explore this.

enragedginger commented 7 years ago

Thanks for putting this together. I'd also like to see us build out some kind of machine learning functionality on top of Onyx. I attempted to build out a Clojure wrapper for Tensorflow that would allow you to build Tensorflow graphs using idiomatic Clojure data structures and patterns (much like Onyx). The issue is that most of the useful functionality in Tensorflow is baked into the Python client library and is not available as part of the data model / C bindings.

One of the huge advantages that Onyx has over everything else in the world is that its API and data model are equivalent. Therefore, when building out any sort of ML functionality on Onyx (and I think this goes without saying), we should to adhere to that principle.

In my experience, most of my machine learning projects perform best using either random forest or (more recently) XGBoost. The other algorithms might be nice for academic purposes, but if we're looking to build something that the 30 or so people who know both Clojure and machine learning are looking to use, then we should probably just focus on those two algorithms. However, I'm open to suggestions.

Also, I think we should not rule out neural networks entirely. I think we should focus on XGBoost and random forest for now and then look at tackling neural networks to some degree after that.

solatis commented 7 years ago

I agree, random forests are one of the more useful algorithms out there and should cover a lot of ground. If we go one level of abstraction higher, and look at whether or not onyx-core needs any additional functionality to make developing these types of algorithms work, what are your thoughts on that ?

Due to Onyx' flexible nature, I'm fairly certain that most of these things can be offloaded to a separate library; perhaps we should just bite the bullet and start working on that, and see where that brings us.

MichaelDrogalis commented 7 years ago

ML isn't exactly my sweet spot -- but do let us know if you'd like estimates of how difficult any changes to core would be to support, or suggestions about how to structure the library.

enragedginger commented 7 years ago

Does anyone have links to any particularly helpful papers on random forest or XGBoost? I'm having a hard time finding anything that gives a clear cut explanation of the algorithms.

alanmarazzi commented 6 years ago

Hi! There's this monograph about random forest and its various declinations. Then there is Chapter 15 of Elements of Statistical Learning which is very good.

Unfortunately for XGBoost I can only suggest Chen's paper.