twitter / scalding

A Scala API for Cascading
http://twitter.com/scalding
Apache License 2.0
3.5k stars 706 forks source link

Throw an exception when improperly using Executions inside of Job subclasses #1616

Open isnotinvain opened 7 years ago

isnotinvain commented 7 years ago

I'm not entirely sure if this is possible, but this confuses our users a lot. It'd be great if we could detect that an execution has been used from inside a job (maybe somewhere in the execution code, or maybe somewhere in the Job.buildFlow() / validate() code?) and make that an error.

This way users will at least get quick feedback about this common pitfall.

oscar-stripe commented 7 years ago

I assume you mean someone does something, but ignores that executions are lazy and does not wire it into ExecutionJob?

isnotinvain commented 7 years ago

Yeah like:

class MyJob(args: Args) extends Job(args) {
  val x = TypedPipe.from(SomeSource()).writeExecution(someSink)
}

A common one we see is users calling .toIterableExecution.waitFor(...) which actually sort of works? But IIRC there are cases where the wrong flow / confnig gets passed to the .waitFor() right? I don't remember the specifics but I think there were cases where you get unexpected results when using your Job's flow / mode / config in an Execution (does that sound like something you know of?)

johnynek commented 7 years ago

You can't use your jobs flowdef. Execution is safe in that regard.

I don't know of any issue now (but there may be some). The main confusing issue is if people don't make sure all the executions are run (and really ideally only one final execution is run). On Mon, Oct 31, 2016 at 17:23 Alex Levenson notifications@github.com wrote:

Yeah like:

class MyJob(args: Args) extends Job(args) { val x = TypedPipe.from(SomeSource()).writeExecution(someSink) }

A common one we see is users calling .toIterableExecution.waitFor(...) which actually sort of works? But IIRC there are cases where the wrong flow / confnig gets passed to the .waitFor() right? I don't remember the specifics but I think there were cases where you get unexpected results when using your Job's flow / mode / config in an Execution (does that sound like something you know of?)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/twitter/scalding/issues/1616#issuecomment-257480258, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEJdufhYyh5Fjg5ihpDJU9eWyTXoKRCks5q5rCrgaJpZM4Kltzs .

isnotinvain commented 7 years ago

Maybe it's the running of > 1 execution that's the problem?

Something like:

class MyJob extends Job {
  val x = TypedPipe.from(Foo()).toIterableExecution.waitFor(...)
  val y = TypedPipe.from(Bar()).toIterableExecution.waitFor(...)
  TypedPipe.from(Baz()).map.filter.write(sink)
}

I can't remember, but we definitely had issues a while back relating to .waitFor() inside of Job picking up some of the Job's config or mode or flowdef or something and that leading to issues.

Are there issues when you run many executions that don't all tie back to a single root execution?

isnotinvain commented 7 years ago

If this isn't a problem -- users seem to really want to be able to mix some executions + waitFor into their Job classes. Maybe that's actually OK to do?

oscar-stripe commented 7 years ago

why do they want to not use Execution the whole way?

Anyway, like I said, I don't know what the issue may have been in the past. Reproductions always welcome.

isnotinvain commented 7 years ago

why do they want to not use Execution the whole way?

They don't even know there are 2 distinct APIs. They have a Job class. They find a method called .toIteratorExecution or .writeExecution and it seems like what they need so they use it. That's what's unfortunate about it. It compiles, and it quietly isn't really what they needed / wanted.

johnynek commented 7 years ago

Well, the same if true of futures. They need to understand Execution if they call methods that return Execution. Calling methods you don't understand is dangerous. On Tue, Nov 1, 2016 at 10:36 Alex Levenson notifications@github.com wrote:

why do they want to not use Execution the whole way?

They don't even know there are 2 distinct APIs. They have Job class. They find a method called .toIteratorExecution or .writeExecution and it seems like what they need so they use it. That's what's unfortunate about it. It compiles, and it quietly isn't really what they needed / wanted.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/twitter/scalding/issues/1616#issuecomment-257687682, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEJdg5bYJOO9hD_jZB1vZTxWQdBXUGfks5q56LggaJpZM4Kltzs .

isnotinvain commented 7 years ago

Well, Futures actually "fire" without you .run()-ing them. so they do at least execute. Not that that'd be better here, they'd be executing in strange order / timing.

isnotinvain commented 7 years ago

Here's a crazy idea, not sure if it's really doable, it'd be a big breaking change but I think it might make things more convenient for users in the log run:

1) remove or deprecate .write() and .writeExecution() from TypedPipe (and all other methods that come in both a Job / Execution style) 2) Add .write() to a TypedPipe enrichment that is in scope in Job 3) Add .writeExecution() to a TypedPipe enrichment that is in scope in ExecutionApp

Then you'd only have access to the right "set" of methods for the context you're working in. Probably would make it harder to discover the .write / .writeExecution methods though.

benpence commented 7 years ago

@isnotinvain Perhaps we should start keeping a tally of how much time it has cost us to solve these problems that users have and if it gets above some threshold, try this change.

I like this idea, although maybe the same people will accidentally find the wrong method anyway?

oscar-stripe commented 7 years ago

if we did this, it would kill the ability to do intermediate writes in method. I would be pretty negative on this as it would break a lot of our jobs (we use Execution style pretty heavily in a lot of new code).

I like making code such that it is harder to use wrong and still have it compile, but this proposal currently seems to break a lot of common use cases just to protect users who probably will have other errors if they misunderstand this much about the code (i.e. are using Execution without reading the docs).

One thing you have to keep in mind: you don't hear much from people using the code everyday in a way that works for them. So, it is hard to guess the impact of hurting those users to help relieve a support burden.

I'm really confused here. How often did someone use .writeExecution instead of .write and how much time did it cost you to help them? This page: http://twitter.github.io/scalding/#com.twitter.scalding.typed.TypedPipe says:

This is the functionally pure approach to building jobs. Note, that you have to call run on the result or flatMap/zip it into an Execution that is run for anything to happen here.

You can only help so much if the person won't read the docs or code, in my view. I guess they were just doing intellij autocomplete and it looked good enough?

isnotinvain commented 7 years ago

It comes up a few times a week I'd say. Keep in mind, a large number of scalding users are doing analytics not as a core part of their job but as one of many steps on the path to doing something else (like feature development). Making a casual users's life easier is generally a good idea I think.

I would not want to do anything that's a huge breaking change. My idea was to do something source (but not binary) compatible.

What does "it would kill the ability to do intermediate writes in method" mean? I can see how this would mess up helper methods that use executions. That's probably a non starter. Maybe the reverse approach would make more sense, find a way (limited only to the Job codepath) that detects use of executions and warns / makes that fatal.

oscar-stripe commented 7 years ago

We don't usually write our code inside an ExecutionApp. We use it like you would any library. So, we want, inside some method somewhere to call all the methods on the ExecutionApp.

What if you guys just had some linter that grepped the code or parsed the class files? It is not that hard to write a program that runs with ASM and just looks for any methods like "writeExecution" in a constructor. That might cover most of your cases.

isnotinvain commented 7 years ago

That may work.

I guess you could potentially have the reverse problem then? A user may mistakenly call .write() inside a helper function when they should have called .writeExecution()?

johnynek commented 7 years ago

you can't call .write without an implicit FlowDef and Mode so that limits the cases where you can accidentally do it.

Look, I mean, this is not Idris, we can't get it to where if it compiles it is correct.