twitter / scalding

A Scala API for Cascading
http://twitter.com/scalding
Apache License 2.0
3.48k stars 704 forks source link

Spark improvements #1868

Closed johnynek closed 5 years ago

johnynek commented 5 years ago

This gets all the of the spark planner working, which is to say, we can convert TypedPipe into spark_backend.Op.

In a follow up PR I have the rest of it, which is the spark Writer which manages the Execution rendering.

this leverages #1867 which was highly useful to implement join/grouping related issues without materializing into memory (at the cost of instead sorting).

This could later be a Config option such that we control if we want to use the sort-based or memory-based approach to implementing join or mapGroup, but I chose to be conservative here and use the more scalable, less memory, version (sorting).

cc @ianoc @non

johnynek commented 5 years ago

@ianoc take a look, I've addressed these two concerns.

johnynek commented 5 years ago

PS: please don't click merge on this. I'll probably merge the second one into this then squash both when they are good.

ianoc commented 5 years ago

lgtm