Closed torbenjuul closed 1 year ago
@AndrewShakinovsky-SAS Can you please review this :-)
@AndrewShakinovsky-SAS I have thought about adding a limited number of retries if we continuously get HTTP-errors. This should then be set quite large (a couple of hours), due to the self-healing nature of K8S. The most important thing is to get Airflow to know the state of the job it has launched as quickly as possible. If for instance the state-call fails, and we have enabled the retry for a task, it will launch yet another job. Then we have the same job running two times, resulting in a lot of conflicts (like locked tables, catalogs etc). In the code change I have suggested, notes are sent to the DAG-log every time a request has failed. Then the person that monitors DAG’s in Airflow will become aware. Any comments?
@torbenjuul I agree that someone might want it to continue retrying indefinitely. But I also believe that someone else might want to abort the run if the system is getting these kinds of errors. Could you add it as a parameter on the operator? In other words, if it is supplied then it will use the retry count, but if not (which could be the default), then it would continue indefinitely?
@AndrewShakinovsky-SAS What about if I add this parameter:
:param unknown_state_timeout: (optional) number of seconds to continue polling for the state of a running job if the state is temporary unobtainable. When unknown_state_timeout is reached without the state being retrievable, the operator will throw an AirflowFailException and the task will be marked as failed. Default value is 0, meaning the task will fail immediately if the state could not be retrieved.
@torbenjuul That sounds perfect.
@AndrewShakinovsky-SAS Adding parameter 'unknown_state_timeout' have now been comitted
…of a job, we set state = "unknown", continue to poll and print the error to the log. Changed poll_interval to 10s thereby reducing the load on the environment when many tasks are running in parallel.