snowplow / dataflow-runner

Run templatable playbooks of Hadoop/Spark/et al jobs on Amazon EMR
http://snowplowanalytics.com
19 stars 8 forks source link

0.3.0 #25

Closed BenFradet closed 7 years ago

BenFradet commented 7 years ago
BenFradet commented 7 years ago

@alexanderdean Ready for a review

Otherwise, it's ready to go given:

BenFradet commented 7 years ago

@alexanderdean I think I've taken care of everything

alexanderdean commented 7 years ago

LGTM! Can we get an -rc1 of this into Bintray please?

BenFradet commented 7 years ago

rc1 is on bintray

alexanderdean commented 7 years ago

Please:

BenFradet commented 7 years ago

rc2 published nvm last commit introduced a bug

BenFradet commented 7 years ago

rc3 should land in bintray soon :tm:

alexanderdean commented 7 years ago

Hey @BenFradet - apologies for the delay in thinking of this requirement, but we need to come up with a dedicated return code (e.g. 3 or 4) for the scenario that the job cannot run because there is an existing lock in place. It might also make sense to make any informational message in this scenario a WARN rather than an ERROR.

Why? This is so that a job runner tool like Factotum can distinguish between a hard failure (e.g. EMR job failed) and a no-op situation (job could not start because existing job is running). That distinction is super-important because we cannot afford to raise a pager alert for every single no-op - to put it another way, if my EMR job takes 15 minutes and I schedule the job every 5 minutes, then I expect 8 no-ops an hour.

Sorry for the delayed requirement!

BenFradet commented 7 years ago

Should this be a new issue since the exit code is always 0 atm?

alexanderdean commented 7 years ago

Hmm - I think this is covered by the lock ticket - as it's just a property of the lock handling?

BenFradet commented 7 years ago

don't you need exit code = 1 for a true failure?

alexanderdean commented 7 years ago

Oh wow, sorry I just saw what you meant. So Dataflow Runner can never return non-zero?! :godmode:

BenFradet commented 7 years ago

not according to https://github.com/urfave/cli#exit-code

alexanderdean commented 7 years ago

Ouch - new ticket then please, this is a bit of a showstopper!

BenFradet commented 7 years ago

my bad the logging library (!!) applies exit code https://github.com/sirupsen/logrus#level-logging

alexanderdean commented 7 years ago

Hmm interesting, can we still do the exit codes in the other ticket though? I don't think we necessarily need an independent ticket though.

BenFradet commented 7 years ago

As your request has shown this isn't viable in the long term (delegating exit code to the logging library). As a result, I think this ticket is warranted because I'm currently moving away from the above.

alexanderdean commented 7 years ago

Cool thanks

BenFradet commented 7 years ago

@alexanderdean published rc4 :+1:

alexanderdean commented 7 years ago

More edgecases - timeWithFormat should be able to take epoch as a string. This is because all of the incoming variables provided by --vars are stringly typed. Here's my error:

template: playbook.avro:37:53: executing "playbook.avro" at <.epochTime>: wrong type for value; expected int64; got string

There could still be a usecase for supporting an int64 as well, but really we could live without it (because we can always put "" around a literal value).

alexanderdean commented 7 years ago

Me again! I think I am seeing an exit code of 17 if there is an error creating the lock file (e.g. because the directory of the lock does not exist). A quick look at the code suggests that Dataflow Runner exits 17 if there is any problem with creating the lock, not just if the lock file is already there?

BenFradet commented 7 years ago

Here's what I currently have with rc5:

df

alexanderdean commented 7 years ago

Thanks @BenFradet, I will give this one a spin. What does "Lock already hel" mean - why is this truncated?

BenFradet commented 7 years ago

It was a typo during the tests which is fixed in the published artifact, it should just say Lock already held.

alexanderdean commented 7 years ago

Cool thanks

coveralls commented 7 years ago

Coverage Status

Coverage increased (+0.02%) to 93.519% when pulling 0aa70ecc1c35351818dac012c17ea2f81aa83cb5 on feature/0.3.0 into 5d07bcef7bf51cf7b444265ca8e9d4bd769c1e61 on master.