treasure-data / digdag

Workload Automation System
https://www.digdag.io/
Apache License 2.0
1.3k stars 221 forks source link

Save job_id of bq operator in commandStatus #1808

Open hnarimiya opened 1 year ago

hnarimiya commented 1 year ago

issue

When the bq operator retries, it references the same job_id and cannot retry correctly. If this is a temporary bq problem, retrying will not solve it. For example, the case below

{
  "message" : "Error encountered during execution. Retrying may solve the problem.",
  "reason" : "backendError"
}

https://cloud.google.com/bigquery/docs/error-messages

resolve

So I saved job_id under commandStatus. This will cause the BaseOperator to remove it on each run.

log

before fix

2023-05-17 19:48:54 +0900 [INFO] (0082@[0:new-project:1:1]+new+retry): Submitting BigQuery job: xxx:digdag_s0_p_new-project_w_new_t_2_a_1_a54c0454-d88f-4f83-bea4-fb01049f7302
2023-05-17 19:48:54 +0900 [INFO] (0082@[0:new-project:1:1]+new+retry): Checking BigQuery job status: xxx:digdag_s0_p_new-project_w_new_t_2_a_1_a54c0454-d88f-4f83-bea4-fb01049f7302
2023-05-17 19:48:55 +0900 [ERROR] (0082@[0:new-project:1:1]+new+retry): BigQuery job failed: xxx:digdag_s0_p_new-project_w_new_t_2_a_1_a54c0454-d88f-4f83-bea4-fb01049f7302
2023-05-17 19:48:55 +0900 [ERROR] (0082@[0:new-project:1:1]+new+retry): {
  "location" : "query",
  "message" : "Unrecognized name: HOGE at [1:8]",
  "reason" : "invalidQuery"
}
2023-05-17 19:48:55 +0900 [ERROR] (0082@[0:new-project:1:1]+new+retry): Task failed, retrying
io.digdag.spi.TaskExecutionException: io.digdag.spi.TaskExecutionException: BigQuery job failed: xxx:digdag_s0_p_new-project_w_new_t_2_a_1_a54c0454-d88f-4f83-bea4-fb01049f7302
...
2023-05-17 19:48:55 +0900 [INFO] (0082@[0:new-project:1:1]+new+retry): Submitting BigQuery job: xxx:digdag_s0_p_new-project_w_new_t_2_a_1_a54c0454-d88f-4f83-bea4-fb01049f7302
2023-05-17 19:48:55 +0900 [INFO] (0082@[0:new-project:1:1]+new+retry): Checking BigQuery job status: xxx:digdag_s0_p_new-project_w_new_t_2_a_1_a54c0454-d88f-4f83-bea4-fb01049f7302
2023-05-17 19:48:55 +0900 [ERROR] (0082@[0:new-project:1:1]+new+retry): BigQuery job failed: xxx:digdag_s0_p_new-project_w_new_t_2_a_1_a54c0454-d88f-4f83-bea4-fb01049f7302
2023-05-17 19:48:55 +0900 [ERROR] (0082@[0:new-project:1:1]+new+retry): {
  "location" : "query",
  "message" : "Unrecognized name: HOGE at [1:8]",
  "reason" : "invalidQuery"
}
2023-05-17 19:48:55 +0900 [ERROR] (0082@[0:new-project:1:1]+new+retry): Task failed, retrying
io.digdag.spi.TaskExecutionException: io.digdag.spi.TaskExecutionException: BigQuery job failed: xxx:digdag_s0_p_new-project_w_new_t_2_a_1_a54c0454-d88f-4f83-bea4-fb01049f7302
...
2023-05-17 19:48:55 +0900 [INFO] (0082@[0:new-project:1:1]+new+retry): Submitting BigQuery job: xxx:digdag_s0_p_new-project_w_new_t_2_a_1_a54c0454-d88f-4f83-bea4-fb01049f7302
2023-05-17 19:48:55 +0900 [INFO] (0082@[0:new-project:1:1]+new+retry): Checking BigQuery job status: xxx:digdag_s0_p_new-project_w_new_t_2_a_1_a54c0454-d88f-4f83-bea4-fb01049f7302
2023-05-17 19:48:56 +0900 [ERROR] (0082@[0:new-project:1:1]+new+retry): BigQuery job failed: xxx:digdag_s0_p_new-project_w_new_t_2_a_1_a54c0454-d88f-4f83-bea4-fb01049f7302
2023-05-17 19:48:56 +0900 [ERROR] (0082@[0:new-project:1:1]+new+retry): {
  "location" : "query",
  "message" : "Unrecognized name: HOGE at [1:8]",
  "reason" : "invalidQuery"
}
2023-05-17 19:48:56 +0900 [ERROR] (0082@[0:new-project:1:1]+new+retry): Task +new+retry failed.
BigQuery job failed: xxx:digdag_s0_p_new-project_w_new_t_2_a_1_a54c0454-d88f-4f83-bea4-fb01049f7302
2023-05-17 19:48:56 +0900 [INFO] (0082@[0:new-project:1:1]+new^failure-alert): type: notify

after fix

2023-05-17 19:58:46 +0900 [INFO] (0079@[0:new-project:1:1]+new+retry): Submitting BigQuery job: xxx:digdag_s0_p_new-project_w_new_t_2_a_1_e3bf09df-e9c8-4d96-b6d7-ea2c47501806
2023-05-17 19:58:46 +0900 [INFO] (0079@[0:new-project:1:1]+new+retry): Checking BigQuery job status: xxx:digdag_s0_p_new-project_w_new_t_2_a_1_e3bf09df-e9c8-4d96-b6d7-ea2c47501806
2023-05-17 19:58:46 +0900 [ERROR] (0079@[0:new-project:1:1]+new+retry): BigQuery job failed: xxx:digdag_s0_p_new-project_w_new_t_2_a_1_e3bf09df-e9c8-4d96-b6d7-ea2c47501806
2023-05-17 19:58:46 +0900 [ERROR] (0079@[0:new-project:1:1]+new+retry): {
  "location" : "query",
  "message" : "Unrecognized name: HOGE at [1:8]",
  "reason" : "invalidQuery"
}
2023-05-17 19:58:46 +0900 [ERROR] (0079@[0:new-project:1:1]+new+retry): Task failed, retrying
io.digdag.spi.TaskExecutionException: io.digdag.spi.TaskExecutionException: BigQuery job failed: xxx:digdag_s0_p_new-project_w_new_t_2_a_1_e3bf09df-e9c8-4d96-b6d7-ea2c47501806
...
2023-05-17 19:58:47 +0900 [INFO] (0079@[0:new-project:1:1]+new+retry): Submitting BigQuery job: xxx:digdag_s0_p_new-project_w_new_t_2_a_1_757ec40d-986e-4a7e-8d39-006d4d432f7e
2023-05-17 19:58:47 +0900 [INFO] (0079@[0:new-project:1:1]+new+retry): Checking BigQuery job status: xxx:digdag_s0_p_new-project_w_new_t_2_a_1_757ec40d-986e-4a7e-8d39-006d4d432f7e
2023-05-17 19:58:47 +0900 [ERROR] (0079@[0:new-project:1:1]+new+retry): BigQuery job failed: xxx:digdag_s0_p_new-project_w_new_t_2_a_1_757ec40d-986e-4a7e-8d39-006d4d432f7e
2023-05-17 19:58:47 +0900 [ERROR] (0079@[0:new-project:1:1]+new+retry): {
  "location" : "query",
  "message" : "Unrecognized name: HOGE at [1:8]",
  "reason" : "invalidQuery"
}
2023-05-17 19:58:47 +0900 [ERROR] (0079@[0:new-project:1:1]+new+retry): Task failed, retrying
io.digdag.spi.TaskExecutionException: io.digdag.spi.TaskExecutionException: BigQuery job failed: xxx:digdag_s0_p_new-project_w_new_t_2_a_1_757ec40d-986e-4a7e-8d39-006d4d432f7e
...
2023-05-17 19:58:47 +0900 [INFO] (0079@[0:new-project:1:1]+new+retry): Submitting BigQuery job: xxx:digdag_s0_p_new-project_w_new_t_2_a_1_2c601416-663d-4df8-a42b-bcc98ddaa9a7
2023-05-17 19:58:48 +0900 [INFO] (0079@[0:new-project:1:1]+new+retry): Checking BigQuery job status: xxx:digdag_s0_p_new-project_w_new_t_2_a_1_2c601416-663d-4df8-a42b-bcc98ddaa9a7
2023-05-17 19:58:48 +0900 [ERROR] (0079@[0:new-project:1:1]+new+retry): BigQuery job failed: xxx:digdag_s0_p_new-project_w_new_t_2_a_1_2c601416-663d-4df8-a42b-bcc98ddaa9a7
2023-05-17 19:58:48 +0900 [ERROR] (0079@[0:new-project:1:1]+new+retry): {
  "location" : "query",
  "message" : "Unrecognized name: HOGE at [1:8]",
  "reason" : "invalidQuery"
}
2023-05-17 19:58:48 +0900 [ERROR] (0079@[0:new-project:1:1]+new+retry): Task failed, retrying
io.digdag.spi.TaskExecutionException: io.digdag.spi.TaskExecutionException: BigQuery job failed: xxx:digdag_s0_p_new-project_w_new_t_2_a_1_2c601416-663d-4df8-a42b-bcc98ddaa9a7
...
2023-05-17 19:58:48 +0900 [INFO] (0079@[0:new-project:1:1]+new+retry): Submitting BigQuery job: xxx:digdag_s0_p_new-project_w_new_t_2_a_1_ad9cadbf-8f69-4638-97de-bf751f350a3b
2023-05-17 19:58:48 +0900 [INFO] (0079@[0:new-project:1:1]+new+retry): Checking BigQuery job status: xxx:digdag_s0_p_new-project_w_new_t_2_a_1_ad9cadbf-8f69-4638-97de-bf751f350a3b
2023-05-17 19:58:49 +0900 [ERROR] (0079@[0:new-project:1:1]+new+retry): BigQuery job failed: xxx:digdag_s0_p_new-project_w_new_t_2_a_1_ad9cadbf-8f69-4638-97de-bf751f350a3b
2023-05-17 19:58:49 +0900 [ERROR] (0079@[0:new-project:1:1]+new+retry): {
  "location" : "query",
  "message" : "Unrecognized name: HOGE at [1:8]",
  "reason" : "invalidQuery"
}
2023-05-17 19:58:49 +0900 [ERROR] (0079@[0:new-project:1:1]+new+retry): Task +new+retry failed.
BigQuery job failed: xxx:digdag_s0_p_new-project_w_new_t_2_a_1_ad9cadbf-8f69-4638-97de-bf751f350a3b
hnarimiya commented 1 year ago

@yoyama @szyn Could you please give me a review?