jpatanooga / Metronome

Suite of parallel iterative algorithms built on top of Iterative Reduce
Apache License 2.0
106 stars 18 forks source link

what exactly is the input data format expected by Metronome? #2

Open pchalasani opened 10 years ago

pchalasani commented 10 years ago

subject says it all

jpatanooga commented 10 years ago

Its really similar to the SVMLight format where its just a CSV style line oriented format, but we changed it slightly to accomodate multiple outputs. The best reference is the unit test:

https://github.com/jpatanooga/Metronome/blob/master/src/test/java/tv/floe/metronome/io/records/TestMetronomeVectorizatonFormat.java

but in general it comes down to a mapping of an input vector to an output vector:

[i0 i1 i2 | o0 o1 o2]

where spaces separate the vector entries and then each is indexed to save space. We provide the vectorization class (MetronomeRecordFactory) with a schema as shown in the unit test.

So yeah its a bit custom, but after looking around and thinking about it we just wanted something simple to map in:output and this made sense.

Adam and I are working on some more robust and complete vectorization tools ( https://github.com/jpatanooga/Canova - still a work in progress) that will interop in a number of formats and run serially or in MapReduce that should make all of this simpler. Today Metronome should be considered alpha/beta software at best and that's why you don't see a more robust set of input formats for every tool. If you compare it to say MLLib in Spark, you'll see that we're at about a similar state (some of their stuff is hardcoded to arbitrary csv formats);

TLDR: yes, vectorization and input formats are important, we;re thinking hard about it all holistically (Canova)

Thanks!

JP

On Wed, Jul 2, 2014 at 8:52 PM, pchalasani notifications@github.com wrote:

subject says it all

— Reply to this email directly or view it on GitHub https://github.com/jpatanooga/Metronome/issues/2.

agibsonccc commented 10 years ago

I would like to add here that this is a big problem. Rather than take an adhoc approach, canova will also support different modes of feature extraction for various kinds of data.

Lots of people don't think about word vectors, moving window on images, and other kinds of the harder formats.

Featurization is a huge problem we'll be tackling here in the coming weeks. As ambitious as it sounds, much of this is being incubated in the deeplearning4j project now, and a more "neutral" version of this with support for SVM light and other formats will be supported by canova.

pchalasani commented 10 years ago

Thanks for the clarifications. I was just trying to figure out how I can (say) use Metronome to deploy deep-learning on Hadoop for one of our data-sets. Eventually, I'll probably put a friendly Clojure wrapper around it.

jpatanooga commented 10 years ago

glad we could help. let me know if you need help getting it going, I can help you triage errors / etc.

JP

On Mon, Jul 7, 2014 at 2:55 PM, pchalasani notifications@github.com wrote:

Thanks for the clarifications. I was just trying to figure out how I can (say) use Metronome to deploy deep-learning on Hadoop for one of out data-sets. Eventually, I'll probably put a friendly Clojure wrapper around it.

— Reply to this email directly or view it on GitHub https://github.com/jpatanooga/Metronome/issues/2#issuecomment-48224012.