hashicorp / terraform-provider-google

Terraform Provider for Google Cloud Platform
https://registry.terraform.io/providers/hashicorp/google/latest/docs
Mozilla Public License 2.0
2.36k stars 1.75k forks source link

google_bigquery_job re-run after 6 months #9768

Open onuruluag opened 3 years ago

onuruluag commented 3 years ago

Community Note

Terraform Version

Terraform v1.0.1 on linux_amd64

Your version of Terraform is out of date! The latest version is 1.0.4. You can update by downloading from https://www.terraform.io/downloads.html

Affected Resource(s)

Terraform Configuration Files

resource "google_bigquery_job" "util_tables_load_job" {
  for_each = local.load_job
  job_id   = format("iter2_util_table_load_%s_%s", trimsuffix(each.value, ".json"), base64encode(google_storage_bucket_object.util_files[each.value].md5hash))
  location = "EU"

  load {
    source_uris = [
      format("gs://%s/%s", google_storage_bucket.deployment_staging_bucket.name, google_storage_bucket_object.util_files[each.value].output_name)
    ]

    destination_table {
      project_id = terraform.workspace
      dataset_id = google_bigquery_dataset.util.dataset_id
      table_id   = trimsuffix(each.value, ".json")
    }

    source_format     = "NEWLINE_DELIMITED_JSON"
    write_disposition = "WRITE_TRUNCATE"

    autodetect = true
  }
}

Expected Behavior

Since the job already ran and there is no change in the JSON file, the BQ load job must not be re-created.

Actual Behavior

After 6 months, GCP deletes the BQ job history, and load jobs are re-created even though there is no change in JSON files.

Steps to Reproduce

  1. terraform apply
  2. confirm load job is created
  3. terraform apply
  4. observe job is not in the deployment plan
  5. wait for 6 months
  6. terraform apply
  7. deployment plan includes BQ load job

The issue is caused by the design of Terraform state management for google_bigquery_job load jobs and GCP job history lifetime (6 months).

b/371632037

rileykarson commented 3 years ago

From an- admittedly naive- provider's perspective, this is working as intended. Since the job in the API has disappeared, it should be recreated, and we can't distinguish between a nonexistent job and one that used to exist.

The only potential fix we could consider is to not check if the job is present in the API, but that would actually break import for the resource- we don't know in the Read function how it has been called, so if the resource was just imported Terraform would report success. Admittedly, this isn't a resource where there is much reason to import it.

onuruluag commented 3 years ago

I think storing the hash of the source as a state for the load jobs would be appropriate since it should only run the load job when the source file is changed. We tried to achieve this behavior by adding google_storage_bucket_object's md5hash value as part of job_id but we have failed when the history disappeared. For a workaround, we may use null_resource and triggers, but this results not to use google_bigquery_job.

SunilPaiEmpire commented 1 year ago

First terraform apply for a BigQuery job 183 days after it was created has caused the same issue for me but instead of re-creating, I get a 404 Not Found and the apply exits with a failure.

Referencing the response by @rileykarson

The only potential fix we could consider is to not check if the job is present in the API

I think this is appropriate given how the provider works natively. A BigQuery Job is created once and cannot be modified. If there is a change to a BigQuery Job resource, this will result in a new Job being created in BigQuery.

but that would actually break import for the resource- we don't know in the Read function how it has been called, so if the resource was just imported Terraform would report success

I think this is appropriate too. Importing a job, whether it exists in BigQuery or has been purged after the 180 period, should indicate to Terraform not to create the resource because it already exists/existed. Isn't this inline with other uses of import?

wmcg commented 1 year ago

I'm inclined to agree with @SunilPaiEmpire points above.

It is frustrating because google_bigquery_job could be a really nice way of managing DML changes - but i don't see how it is usable currently on its own?

We are considering a script we can run as part of our workflow that will check for and remove jobs over a certain age but it complicates what would otherwise be a very tidy process.

wj-chen commented 1 month ago

Revisiting this as the same issue was recently brought up again by another user.

@rileykarson Could you forward it to our internal issue tracker too?

Since the jobs are not persistent past 6 months on the server side, we'll need to think of something on the provider level. Could you elaborate on what "not check if the job is present in the API" is? Is it that we don't read the live state at all as we would never attempt to re-create any BigQuery job in Terraform?

rileykarson commented 1 month ago

Yeah- instead of clearing the resource id on a 404 error (this code) we'd and preserve the old state of the resource by immediately exiting Read w/o an error.

We'd have to figure out what updates to a purged job definition mean as part of that, I think. We could track purged jobs with a boolean clientside, and then use a CustomizeDiff to upgrade update plans to deletions, or add a guard to Update so that updates are no-ops.

I think for the import case I was thinking about, we could only allow importing valid jobs by tracking a flag clientside for that- we'd start it false and set it to true after a successful read. Since Import is ultimately a Read call, we'd return an error on a 404 where that flag is still false. Or just take the "import of nonexisting jobs always succeeds" approach from https://github.com/hashicorp/terraform-provider-google/issues/9768#issuecomment-1377615592.

wj-chen commented 1 month ago

Yeah- instead of clearing the resource id on a 404 error (this code) we'd and preserve the old state of the resource by immediately exiting Read w/o an error.

We'd have to figure out what updates to a purged job definition mean as part of that, I think. We could track purged jobs with a boolean clientside, and then use a CustomizeDiff to upgrade update plans to deletions, or add a guard to Update so that updates are no-ops.

I think for the import case I was thinking about, we could only allow importing valid jobs by tracking a flag clientside for that- we'd start it false and set it to true after a successful read. Since Import is ultimately a Read call, we'd return an error on a 404 where that flag is still false. Or just take the "import of nonexisting jobs always succeeds" approach from #9768 (comment).

Thanks again for the details. I'm trying to see if we can simplify the handling by just preserving any existing state without having to rely on additional client-side field. Can you confirm that if we return nil on 404 from Get API response inside JobRead(), the job resource in the state file will remain unchanged, and that will be used for comparison with the user's config file i.e. diff detection?

@rileykarson Could you correct any misunderstanding above? What do you think of the overall plan?