twitter / scalding

A Scala API for Cascading
http://twitter.com/scalding
Apache License 2.0
3.5k stars 706 forks source link

move the typed API into a scalding-typed package with few dependencies #1737

Open johnynek opened 6 years ago

johnynek commented 6 years ago

The goal here is to extract the Typed API of scalding as a generic dataflow AST.

With this, we have a clean API to make scalding backends without depending on cascading (so directly on spark, apache beam, in-memory on AWS, etc...)

The challenge here are several:

1) how to remove the cascading dependent APIs (those that take Flows or Pipes) without breaking existing code. I think a solution here is implicit enrichment. We could enrich TypedPipe in Job with an implicit that adds the cascading functions back in. 2) How to allow scalding Jobs to run on different platforms. I think what we want to do is look through a FlowDef along with the FlowStateMap and see if we can convert a FlowDef back into a set of TypedPipes + writes of those into sinks. If we can definitely safely do the conversion, it succeeds, else we report failure. After that, we convert this batch of writes to an Execution. After running the execution, we call Job.next and repeat the process. 3) Address the most common input-output formats in a spark backend. So, users should be able to use TSV, CSV, SequenceFiles, and Parquet and run with spark. If we can't do this, multi-backend may be too cumbersome to actually use.

daniel-sudz commented 2 years ago

@johnynek I've looked at this for a bit and I have a general question:

it seems like you want to keep cascading support within the multi-backend model. Personally I would prefer to drop cascading support if we provide first-class support for a spark backend.

With a backend like spark, we get free compatibility guarantees with the hadoop version. However with cascading, this is an additional maintenance burden to push changes up to cascading with each hadoop release to make it compatible.

It also seems to me that there isn't much trust in upgrading cascading versions due to fear of things breaking or introducing new bugs. Which is valid, but in this case the cascading backend will be dragging the hadoop version behind and people won't be able to use scalding in newer hadoop environments like AWS EMR6.

I understand this would be a major breaking change so I'm curious on your thoughts on this as well.

johnynek commented 2 years ago

Well, this issue is rather old.

Since it was created there have been 4 backends implemented: cascading 2, pure scala with Futures, spark, beam. So the current architecture can work without a binary incompatible break.

Next: I don't personally use scalding anymore although I do review code and offer some guidance on design questions.

Last, I left Twitter 6 years ago, but I understand there are still many scalding jobs. They are in process of moving some or maybe all to beam (either directly or on top of scalding-beam). It's not clear to me.

I think making scalding-typed be independent of cascading would be interesting, but I think making difficult migrations would likely mean few would adopt it.

If such a redesign could be done such that almost all jobs didn't need any source code change, I think it is worth it.

daniel-sudz commented 2 years ago

thanks for the info, a small followup on these backends:

for writing jobs against a backend like scalding-beam or scalding-spark, would one just import directly against these backends? So in a sense, they are separate projects that bypass the cascading implementation in scalding-core?

also, is there anyone left as a point of contact from Twitter? It's hard to tell by github usernames which company someone is affiliated with.

johnynek commented 2 years ago

We should write documentation I guess but you can look at the tests in those repos for examples.

You will still have cascading on the class path but it won't be exercised.

daniel-sudz commented 2 years ago

yeah I've seen the tests they are pretty good as a reference for now

I guess the problem that I have is that if cascading is on the classpath then that means that hadoop2 is on the classpath. I really would like to have some way of getting support for hadoop3.

I don't think it's possible to achieve this with binary/source compatibility. For one thing: spark3 and hadoop3 are scala-2.12 or scala-2.13 compatible only. And over time I'm sure that 2.12 will also be dropped eventually probably not this year but it will happen:

https://spark.apache.org/docs/latest/

Spark runs on Java 8/11, Scala 2.12/2.13, Python 3.6+ and R 3.5+. Python 3.6 support is deprecated as of Spark 3.2.0. Java 8 prior to version 8u201 support is deprecated as of Spark 3.2.0. For the Scala API, Spark 3.2.1 uses Scala 2.12. You will need to use a compatible Scala version (2.12.x).

And then secondly, there would need to be some dependency management magic to exclude all transitive dependencies of cascading due to the above issue with hadoop for scalding-spark. Then we would have a fork-island within scalding of basically separate maintenance for different projects (the "old" scalding/cascading and the "new" scalding/spark and scalding/beam).

If these are separate projects with separate dependencies it makes sense to split them into different repos because code re-use would slowly disappear with dependency conflicts over time. Trying to support both hadoop2 and hadoop3 globally is a loosing battle (cascading had a similar problem before).

I think my options are:

  1. get agreement to release a major version of scalding with potentially many breaking changes by removing cascading. Then can proceed to upgrade to hadoop3. (seems like low appetite? but long-term best option for maintenance IMO)
  2. fork scalding and deal with the mess somewhere else
  3. commit to iron out upgrade towards cascading 4.5-WIP with potentially fixing upstream bugs (seems like low appetite as well and I don't think this is the right move).
  4. write yet another "backend" for hadoop3 and then port my scalding jobs to import against this backend. Do some carefully dependency management to remove any cascading/hadoop dependencies from the rest of scalding. Upgrading the existing spark backend to spark3 is probably easier and will accomplish the same thing.