If we get an exception or an invalid response when polling the state …

sassoftware / sas-airflow-provider

Apache Airflow Provider for creating tasks in Airflow to execute SAS Studio Flows and Jobs.

Apache License 2.0

18 stars 15 forks source link

If we get an exception or an invalid response when polling the state … #22

Closed torbenjuul closed 1 year ago

torbenjuul commented 1 year ago

…of a job, we set state = "unknown", continue to poll and print the error to the log. Changed poll_interval to 10s thereby reducing the load on the environment when many tasks are running in parallel.

torbenjuul commented 1 year ago

@AndrewShakinovsky-SAS Can you please review this :-)

torbenjuul commented 1 year ago

@AndrewShakinovsky-SAS I have thought about adding a limited number of retries if we continuously get HTTP-errors. This should then be set quite large (a couple of hours), due to the self-healing nature of K8S. The most important thing is to get Airflow to know the state of the job it has launched as quickly as possible. If for instance the state-call fails, and we have enabled the retry for a task, it will launch yet another job. Then we have the same job running two times, resulting in a lot of conflicts (like locked tables, catalogs etc). In the code change I have suggested, notes are sent to the DAG-log every time a request has failed. Then the person that monitors DAG’s in Airflow will become aware. Any comments?

AndrewShakinovsky-SAS commented 1 year ago

@torbenjuul I agree that someone might want it to continue retrying indefinitely. But I also believe that someone else might want to abort the run if the system is getting these kinds of errors. Could you add it as a parameter on the operator? In other words, if it is supplied then it will use the retry count, but if not (which could be the default), then it would continue indefinitely?

torbenjuul commented 1 year ago

@AndrewShakinovsky-SAS What about if I add this parameter:

:param unknown_state_timeout: (optional) number of seconds to continue polling for the state of a running job if the state is temporary unobtainable. When unknown_state_timeout is reached without the state being retrievable, the operator will throw an AirflowFailException and the task will be marked as failed. Default value is 0, meaning the task will fail immediately if the state could not be retrieved.

AndrewShakinovsky-SAS commented 1 year ago

@torbenjuul That sounds perfect.

torbenjuul commented 1 year ago

@AndrewShakinovsky-SAS Adding parameter 'unknown_state_timeout' have now been comitted