nathanmarz / cascalog

Data processing on Hadoop without the hassle.
Other
1.38k stars 178 forks source link

kryo stack overflow on flow serialization #279

Open waldoppper opened 9 years ago

waldoppper commented 9 years ago

Use of anonymous mapfns inside threaded queries lead to stack overflow exceptions. Assuming this is occuring when attempting to serialize the flow. I'm using 3.0.0-SNAPSHOT

(def CONFIG {:a 1 :b 2})
(def DATA [[:a] [:b] [:c]])

(defn q1
  ;;small mapside hashjoin
  [mygen conf-map]
  (<- [?x ?y]
    (mygen ?x)
    ((mapfn [x] (get conf-map x 0)) ?x :> ?y)
    (:distinct false)))

(defn q2
  ;;2x multiplier
  [g]
  (<- [?x ?y ?z]
    (g ?x ?y)
    ((mapfn [n] (* n 2)) ?y :> ?z)
    (:distinct false)))

(defn q3
  ;;q1 and q2 combined
  [mygen conf-map]
  (<- [?x ?y ?z]
    (mygen ?x)
    ((mapfn [x] (get conf-map x 0)) ?x :> ?y)
    ((mapfn [y] (* y 2)) ?y :> ?z)
    (:distinct false)))

;;Tests
(??- (-> DATA (q1 CONFIG) )) ;; (([:a 1] [:b 2] [:c 0]))
(??- (-> DATA (q1 CONFIG) q2 )) ;;blows up with stack overflow

(??- (-> DATA (q3 CONFIG) )) ;; (([:a 1 2] [:b 2 4] [:c 0 0]))
ipostelnik commented 9 years ago

This happens because we try to capture the environment of each anonymous mapfn. Inside q2 we capture g as part of the environment. This includes the full definition of the parent query with all the parameters and all captured environments of the anonymous operations defined there.

I think we should heavily discourage anonymous functions in queries.

ipostelnik commented 9 years ago

@sritchie any ideas on this one?

sritchie commented 9 years ago

Yeah, not sure how we'd cache across serialization boundaries. This is a strange one. I don't have time to dive deep on this fix, but even erroring out with a proper warning if you try to double-serialize an anonymous function would be good.

sritchie commented 9 years ago

I guess the outer serialization has access to the source code on the inner serialization, and could just substitute that in instead? Brainstorming from the phone here.

ipostelnik commented 9 years ago

The problem is that anonymous functions close over all symbols in their scope. Since we don't prune captured environment, the generator argument is part of the serialized function and, eventually, the query. Once you pass this query to another function using an anonymous function, the inner query is repeated inside the environment of the second function.

I still don't understand why this would cause serialization cycles. It's certainly wasteful in term of the size of the flow and could cause memory issues if we close over a large data structure unintentionally.