Explore options around specifying ways to react to step failures

snowplow / dataflow-runner

Run templatable playbooks of Hadoop/Spark/et al jobs on Amazon EMR

http://snowplowanalytics.com

19 stars 8 forks source link

Explore options around specifying ways to react to step failures #15

Open BenFradet opened 7 years ago

BenFradet commented 7 years ago

See discussion in #11

alexanderdean commented 7 years ago

The idea is that some step "failures" may indicate a no-op rather than a failure per se. For example, if your jobflow step is designed to simply check if there is new data in HBase or DynamoDB to process, then you probably do want the jobflow to terminate, but don't want to bubble up an overall jobflow failure to your pagerduty.

It would be particularly interesting if we could somehow get different return codes from jobflow steps, so that we can distinguish between dynamodb_has_new_data.jar failing and reporting no new data to process.

BenFradet commented 7 years ago

From my tests, it seems the return code from a step isn't reported back up the EMR chain.

cf StepStateChangeReason's Code which is always None.

alexanderdean commented 7 years ago

Is there any way of capturing a message?

BenFradet commented 7 years ago

I haven't dug into the message, it might be capturing stderr, I'll have to try that out.

What we do in eer is inspect which step resulted in a failure and if it's a no-op detecting step, we respond appropriately. Unfortunately, that's not really generic.

alexanderdean commented 7 years ago

The generic version of what we do in EER is to add a Factotum-like behavior property to each jobflow step definition:

{
"jarfile": "dynamodb_has_new_data.jar",
"action_on_failure": "TERMINATE_WITH_FAILURE" <<default>> | "TERMINATE_WITH_SUCCESS"

BenFradet commented 7 years ago

That combined with a way to provide feedback (maybe through StepStateChangeReason's Message) would solve our issue, indeed.

alexanderdean commented 7 years ago

Yes - fingers crossed for StepStateChangeReason's Message being usable!

BenFradet commented 7 years ago

Unfortunately, emr doesn't pick up anything from a script step.

{  
   "Step":{  
      "ActionOnFailure":"CANCEL_AND_WAIT",
      "Config":{  
         "Args":[  
            "s3://snowplow-hosted-assets-eu-central-1/common/emr/snowplow-check-dir-empty.sh",
            "s3://ben-test-output/processing/raw/"
         ],
         "Jar":"s3://eu-central-1.elasticmapreduce/libs/script-runner/script-runner.jar",
         "MainClass":null,
         "Properties":{  

         }
      },
      "Id":"s-14DXZAQ9JXDYD",
      "Name":"Checking that s3://ben-test-output/processing/raw/ is empty",
      "Status":{  
         "FailureDetails":{  
            "LogFile":"s3://ben-test-output/logs/j-3TOU8BN6L2QUX/steps/s-14DXZAQ9JXDYD/",
            "Message":null,
            "Reason":"Unknown Error."
         },
         "State":"FAILED",
         "StateChangeReason":{  
            "Code":null,
            "Message":null
         },
         "Timeline":{  
            "CreationDateTime":"2017-03-23T18:24:26Z",
            "EndDateTime":"2017-03-23T18:24:48Z",
            "StartDateTime":"2017-03-23T18:24:44Z"
         }
      }
   }
}

As a result, we could make do with terminate_success/terminate_failure but we wouldn't have any feedback to give.

alexanderdean commented 7 years ago

Shame!

alexanderdean commented 7 years ago

Pushing back as I actually think no-ops are a bit of a red herring and we are better off with #17...

BenFradet commented 6 years ago

pushing back