twitter / scalding

A Scala API for Cascading
http://twitter.com/scalding
Apache License 2.0
3.5k stars 706 forks source link

Add more structure to the Spark backend #1844

Closed johnynek closed 6 years ago

johnynek commented 6 years ago

This follows the MemoryBackend pattern of introducing an Op type that we are planning onto. This Op in spark is basically calling a function with a SparkContext and ExecutionContext to produce a Future of an RDD.

This has the nice property that we don't take the SparkContext when we are planning, only when running.

Secondly, I filled in the other missing stuff: the SparkWriter, which manages writes as we evaluate Executions, and also the mapping of sources and sinks.

In I think 2-3 following PRs we can finish:

  1. implement the writer
  2. finish the planner

Note, the writer and planner implementation work can go on in parallel. So I can just fork myself and finish faster.

johnynek commented 6 years ago

@fwbrasil @ianoc can you take a look?

ianoc commented 6 years ago

not sure what start and finish are for in the writer trait. but lgtm