nathanmarz / cascalog

Data processing on Hadoop without the hassle.
Other
1.38k stars 178 forks source link

Make parallel buffers able to optionally use a bufferiter for the final aggregation #240

Closed funkenblatt closed 10 years ago

funkenblatt commented 10 years ago

I'm running into a problem where the final reduce-side buffered aggregation for for a parallel buffer operation is running out of memory. This should help me with that. Not sure on what the API for it should look like though.

sritchie commented 10 years ago

Does defbufferiterfn help?

https://github.com/nathanmarz/cascalog/blob/develop/cascalog-core/src/clj/cascalog/api.clj#L294

funkenblatt commented 10 years ago

defbufferiterfn would work, but for my particular case it'd be nice to get some of the work done on the map side.

funkenblatt commented 10 years ago

Meh. I think I might just use the low-level cascading DSL for this instead of attempting to integrate it into the parallel buffer stuff.