Open BenFradet opened 7 years ago
The idea is that some step "failures" may indicate a no-op rather than a failure per se. For example, if your jobflow step is designed to simply check if there is new data in HBase or DynamoDB to process, then you probably do want the jobflow to terminate, but don't want to bubble up an overall jobflow failure to your pagerduty.
It would be particularly interesting if we could somehow get different return codes from jobflow steps, so that we can distinguish between dynamodb_has_new_data.jar failing and reporting no new data to process.
From my tests, it seems the return code from a step isn't reported back up the EMR chain.
cf StepStateChangeReason's Code which is always None.
Is there any way of capturing a message?
I haven't dug into the message, it might be capturing stderr, I'll have to try that out.
What we do in eer is inspect which step resulted in a failure and if it's a no-op detecting step, we respond appropriately. Unfortunately, that's not really generic.
The generic version of what we do in EER is to add a Factotum-like behavior property to each jobflow step definition:
{
"jarfile": "dynamodb_has_new_data.jar",
"action_on_failure": "TERMINATE_WITH_FAILURE" <<default>> | "TERMINATE_WITH_SUCCESS"
That combined with a way to provide feedback (maybe through StepStateChangeReason's Message) would solve our issue, indeed.
Yes - fingers crossed for StepStateChangeReason's Message being usable!
Unfortunately, emr doesn't pick up anything from a script step.
{
"Step":{
"ActionOnFailure":"CANCEL_AND_WAIT",
"Config":{
"Args":[
"s3://snowplow-hosted-assets-eu-central-1/common/emr/snowplow-check-dir-empty.sh",
"s3://ben-test-output/processing/raw/"
],
"Jar":"s3://eu-central-1.elasticmapreduce/libs/script-runner/script-runner.jar",
"MainClass":null,
"Properties":{
}
},
"Id":"s-14DXZAQ9JXDYD",
"Name":"Checking that s3://ben-test-output/processing/raw/ is empty",
"Status":{
"FailureDetails":{
"LogFile":"s3://ben-test-output/logs/j-3TOU8BN6L2QUX/steps/s-14DXZAQ9JXDYD/",
"Message":null,
"Reason":"Unknown Error."
},
"State":"FAILED",
"StateChangeReason":{
"Code":null,
"Message":null
},
"Timeline":{
"CreationDateTime":"2017-03-23T18:24:26Z",
"EndDateTime":"2017-03-23T18:24:48Z",
"StartDateTime":"2017-03-23T18:24:44Z"
}
}
}
}
As a result, we could make do with terminate_success/terminate_failure but we wouldn't have any feedback to give.
Shame!
Pushing back as I actually think no-ops are a bit of a red herring and we are better off with #17...
pushing back
See discussion in #11