onyx-platform / onyx

Distributed, masterless, high performance, fault tolerant data processing
http://www.onyxplatform.org
Eclipse Public License 1.0
2.05k stars 204 forks source link

Integration with Tachyon? #386

Open kovasb opened 8 years ago

kovasb commented 8 years ago

I would be pretty interesting in some built-in support for this. http://tachyon-project.org/

MichaelDrogalis commented 8 years ago

We'll take a look at this. Thanks for the idea! :)

kovasb commented 8 years ago

Cool. It seems to have huge momentum and many different kinds of use cases.

For instance u can put the rocksdb data on there, and have it transparently flush cold data to different storage tiers. Or move it around from machine to machine.

My use case is more for batch data processing, where workers can write data to tachyon and let it deal with moving from machine to machine (and persistent to permanent storage), with onyx orchestrating work unit distribution.

MichaelDrogalis commented 8 years ago

It's going to be a while before we can look at it for that kind of use case (Feb-March), we have other things that are higher priority at the moment. I'll keep it in the back of my mind going forward though. It would be nice to get a plugin to read from Tachyon as a generic input/output stream. That could happen sooner.

lbradstreet commented 8 years ago

I'd read about tachyon a while back and definitely had it on my list of things to check again later. I'm definitely interested though, as Michael says, it may take some time.

ohpauleez commented 8 years ago

Just adding some ammunition here - I built a very fast and very successful distributed computational pipeline that heavily used Tachyon. I think to get the most out of Tachyon might involve some rearchitecting of Onyx.

MichaelDrogalis commented 8 years ago

Thanks @ohpauleez. We're unlikely to make a major architectural pivot as the streaming engine is performing well (and is a large investment), so we appreciate the data point.

lbradstreet commented 8 years ago

We could probably do something similar to Flink and provide a tachyon input and output plugin, and or useful lifecycle calls that would allow peers to load data from tachyon as part of the usual task lifecycle.

With our new upcoming scheduler we could probably get even greater improvements ensuring we get some nice data locality properties by scheduling tasks requiring that data near where the data is stored in tachyon.

This is still not a priority for us, and we haven't seen the demand yet, but if anyone is interested enough I'd be happy to devote my time assisting with any questions and help where I can.