snowplow / dataflow-runner

Run templatable playbooks of Hadoop/Spark/et al jobs on Amazon EMR
http://snowplowanalytics.com
19 stars 8 forks source link

Fix misleading status of failed job #21

Open chuwy opened 7 years ago

chuwy commented 7 years ago

When step fails on transient run, EMR console shows that cluster is terminated by User request, which is misleading.

jbeemster commented 7 years ago

Not sure this can be changed as to get a different message you would need to allow a step to terminate the cluster - which it is not allowed to do currently.

chuwy commented 7 years ago

Not sure I understand, because in fact step terminates cluster (so, I imply it allowed to do this), but with wrong (successful) message.

jbeemster commented 7 years ago

Dataflow Runner explicitly disallows steps to terminate the cluster, the only allowed step actions are located here: https://github.com/snowplow/dataflow-runner/blob/master/src/job_flow_steps.go#L122

So the steps will stop running if you have CANCEL_AND_WAIT and a failing step after which you will move to cluster termination - which is handled manually through the API - hence the User Terminated message.

Agreed that it is the wrong message!

Allowing different step failure actions would be one way to get the correct message or submitting a step that has to fail to terminate could be options..

chuwy commented 7 years ago

Thanks! It makes sense now.

alexanderdean commented 7 years ago

It's a bit confusing in the EMR UI, but the core design is working well for now. See also: https://github.com/snowplow/dataflow-runner/issues/3