twitter / scalding

A Scala API for Cascading
http://twitter.com/scalding
Apache License 2.0
3.5k stars 706 forks source link

scalding-beam improvements on resulting DAG #1957

Closed tlazaro closed 2 years ago

tlazaro commented 2 years ago
tlazaro commented 2 years ago

We haven't tackled toIterableExecution and forceToDiskExecution yet. Still trying to get a successful run for a specific pretty large job.

tlazaro commented 2 years ago

Added some improvements to CoGroup so the TupleToKv doesn't show in the top level ui and rather lives inside CoGroup outer PTransform.

Added caching to BeamOp following @johnynek's advice, new implementations should by default be built using the cache preventing mistakes.

Added very shallow tests for the caching. Our goal should be testing the structure of the Pipeline in code, instead of visually in a Dataflow UI. The Pipeline DAG is a bit tedious to work with, the way would be to use the visitor pattern it has and perform mutations to build a better DAG instance where could perform better analysis.

tlazaro commented 2 years ago

Can't afford to do proper structural testing now, though I'm sure will come back as a problem later.