nathanmarz / cascalog

Data processing on Hadoop without the hassle.
Other
1.38k stars 178 forks source link

Prepared operations are confusing #269

Open ipostelnik opened 9 years ago

ipostelnik commented 9 years ago

Functions declared using (prepfn) are treated by cascalog as vanila clojure funcitons and so behave as regular map/filter functions. If you want to make a prepared mapcat/buffer/aggregator you have to write:

(def prepped-mapcat-op (mapcatop (prepfn [fp call] ...)))

rather than returning a mapcatfn out of prepfn as examples imply. In fact, there's never a reason to use any special *fn's in the body of a prepared function.

We should come up with a better syntax for these or make better docs.

sritchie commented 9 years ago

Yeah, prepfn transforms inner calls to fn into the serializable version - if you're writing fns inline, you're correct, but if you define the internal fns in a let binding outside, like

(let [op (fn [x] ....)]
  (prepfn [fp call] op))

It won't work. So, sort of tricky.

I'm open to other ideas for syntax for sure. One idea is to define prep versions of all the macros, but that seems janky (defprepmapcatfn, crazy times!)

ipostelnik commented 9 years ago

I think at a minimum we should clarify that prepfn is a peer of s/fn, so effectively it's just a special kind of vanilla function. Using it in other contexts requires lifting via macatop/bufferop/etc...

sritchie commented 9 years ago

@ipostelnik a lot of prepfn impls I've seen exist to get access to counters. what do you think of #270 as a nicer way to get access to this stuff?

ipostelnik commented 9 years ago

I really like the stats implementation that's hides the guts of hadoop counters.

We have 2 use cases for prepared functions - counters and (effectively) simulating hash joins. We have a lot of variants of the latter that use richer data structures and logic than what plain hash-join allows.

On Fri, Feb 20, 2015 at 9:48 AM, Sam Ritchie notifications@github.com wrote:

@ipostelnik https://github.com/ipostelnik a lot of prepfn impls I've seen exist to get access to counters. what do you think of #270 https://github.com/nathanmarz/cascalog/pull/270 as a nicer way to get access to this stuff?

— Reply to this email directly or view it on GitHub https://github.com/nathanmarz/cascalog/issues/269#issuecomment-75259541.

sritchie commented 9 years ago

Nice. Custom hash joins are actually my next thing I wanted to play with. Would love some input on what you guys are doing, if you have any examples you might share.

waldoppper commented 9 years ago

@sritchie here is a goofy (cascalog v1) example:

(deffilterop ^:stateful user-can-enter-party?
  "A map-op which reads/parses a complex object in distributed cache to create a map of party-name->participant-set in order to filter out users in line at a party"
  ([] (-> (read-cached-party-information) make-party->user-set))

  ([party->user-set party user]
    (let [user-set (get party->user-set party)]
      (contains? user-set user)))

  ([_]))
ipostelnik commented 9 years ago

After more thinking about this - the big problem is in the java Cascading/Clojure bridge. We need to know at query planning time to either emit ClojureMapcat or ClojureMap operation. Instead, we should use function metadata to decide how to translate return value into tuples.

sritchie commented 9 years ago

Oh, that's interesting. Yeah, I guess we could access that metadata from within Java. Any interest in trying that out, @ipostelnik ?

ipostelnik commented 9 years ago

I ended up writing a couple of macros modeled after XXXop and defXXXfn. See here for code https://gist.github.com/ipostelnik/1d5566322fa1dec97b0a

I also wrote simple wrappers that lift get (as map and mapcat) and contains? into stateful ops using state loaded into a map or set.