google_bigquery_job re-run after 6 months

onuruluag commented 3 years ago

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
If you are interested in working on this issue or have submitted a pull request, please leave a comment.
If an issue is assigned to the modular-magician user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to hashibot, a community member has claimed the issue already.

Terraform Version

Terraform v1.0.1 on linux_amd64

provider registry.terraform.io/hashicorp/archive v2.2.0
provider registry.terraform.io/hashicorp/external v2.1.0
provider registry.terraform.io/hashicorp/google v3.77.0
provider registry.terraform.io/hashicorp/google-beta v3.77.0
provider registry.terraform.io/hashicorp/null v3.1.0
provider registry.terraform.io/hashicorp/template v2.2.0

Your version of Terraform is out of date! The latest version is 1.0.4. You can update by downloading from https://www.terraform.io/downloads.html

Affected Resource(s)

google_bigquery_job (load job)

Terraform Configuration Files

resource "google_bigquery_job" "util_tables_load_job" {
  for_each = local.load_job
  job_id   = format("iter2_util_table_load_%s_%s", trimsuffix(each.value, ".json"), base64encode(google_storage_bucket_object.util_files[each.value].md5hash))
  location = "EU"

  load {
    source_uris = [
      format("gs://%s/%s", google_storage_bucket.deployment_staging_bucket.name, google_storage_bucket_object.util_files[each.value].output_name)
    ]

    destination_table {
      project_id = terraform.workspace
      dataset_id = google_bigquery_dataset.util.dataset_id
      table_id   = trimsuffix(each.value, ".json")
    }

    source_format     = "NEWLINE_DELIMITED_JSON"
    write_disposition = "WRITE_TRUNCATE"

    autodetect = true
  }
}

Expected Behavior

Since the job already ran and there is no change in the JSON file, the BQ load job must not be re-created.

Actual Behavior

After 6 months, GCP deletes the BQ job history, and load jobs are re-created even though there is no change in JSON files.

Steps to Reproduce

terraform apply
confirm load job is created
terraform apply
observe job is not in the deployment plan
wait for 6 months
terraform apply
deployment plan includes BQ load job

The issue is caused by the design of Terraform state management for google_bigquery_job load jobs and GCP job history lifetime (6 months).

b/371632037

rileykarson commented 3 years ago

From an- admittedly naive- provider's perspective, this is working as intended. Since the job in the API has disappeared, it should be recreated, and we can't distinguish between a nonexistent job and one that used to exist.

The only potential fix we could consider is to not check if the job is present in the API, but that would actually break import for the resource- we don't know in the Read function how it has been called, so if the resource was just imported Terraform would report success. Admittedly, this isn't a resource where there is much reason to import it.

onuruluag commented 3 years ago

I think storing the hash of the source as a state for the load jobs would be appropriate since it should only run the load job when the source file is changed. We tried to achieve this behavior by adding google_storage_bucket_object's md5hash value as part of job_id but we have failed when the history disappeared. For a workaround, we may use null_resource and triggers, but this results not to use google_bigquery_job.

SunilPaiEmpire commented 1 year ago

First terraform apply for a BigQuery job 183 days after it was created has caused the same issue for me but instead of re-creating, I get a 404 Not Found and the apply exits with a failure.

Referencing the response by @rileykarson

The only potential fix we could consider is to not check if the job is present in the API

I think this is appropriate given how the provider works natively. A BigQuery Job is created once and cannot be modified. If there is a change to a BigQuery Job resource, this will result in a new Job being created in BigQuery.

but that would actually break import for the resource- we don't know in the Read function how it has been called, so if the resource was just imported Terraform would report success

I think this is appropriate too. Importing a job, whether it exists in BigQuery or has been purged after the 180 period, should indicate to Terraform not to create the resource because it already exists/existed. Isn't this inline with other uses of import?

wmcg commented 1 year ago

I'm inclined to agree with @SunilPaiEmpire points above.

It is frustrating because google_bigquery_job could be a really nice way of managing DML changes - but i don't see how it is usable currently on its own?

We are considering a script we can run as part of our workflow that will check for and remove jobs over a certain age but it complicates what would otherwise be a very tidy process.

wj-chen commented 1 month ago

Revisiting this as the same issue was recently brought up again by another user.

@rileykarson Could you forward it to our internal issue tracker too?

Since the jobs are not persistent past 6 months on the server side, we'll need to think of something on the provider level. Could you elaborate on what "not check if the job is present in the API" is? Is it that we don't read the live state at all as we would never attempt to re-create any BigQuery job in Terraform?

rileykarson commented 1 month ago

Yeah- instead of clearing the resource id on a 404 error (this code) we'd and preserve the old state of the resource by immediately exiting Read w/o an error.

We'd have to figure out what updates to a purged job definition mean as part of that, I think. We could track purged jobs with a boolean clientside, and then use a CustomizeDiff to upgrade update plans to deletions, or add a guard to Update so that updates are no-ops.

I think for the import case I was thinking about, we could only allow importing valid jobs by tracking a flag clientside for that- we'd start it false and set it to true after a successful read. Since Import is ultimately a Read call, we'd return an error on a 404 where that flag is still false. Or just take the "import of nonexisting jobs always succeeds" approach from https://github.com/hashicorp/terraform-provider-google/issues/9768#issuecomment-1377615592.

wj-chen commented 1 month ago

Yeah- instead of clearing the resource id on a 404 error (this code) we'd and preserve the old state of the resource by immediately exiting Read w/o an error.

We'd have to figure out what updates to a purged job definition mean as part of that, I think. We could track purged jobs with a boolean clientside, and then use a CustomizeDiff to upgrade update plans to deletions, or add a guard to Update so that updates are no-ops.

I think for the import case I was thinking about, we could only allow importing valid jobs by tracking a flag clientside for that- we'd start it false and set it to true after a successful read. Since Import is ultimately a Read call, we'd return an error on a 404 where that flag is still false. Or just take the "import of nonexisting jobs always succeeds" approach from #9768 (comment).

Thanks again for the details. I'm trying to see if we can simplify the handling by just preserving any existing state without having to rely on additional client-side field. Can you confirm that if we return nil on 404 from Get API response inside JobRead(), the job resource in the state file will remain unchanged, and that will be used for comparison with the user's config file i.e. diff detection?

For first-time Create, no change - we save the resource to state if creation is successful and leave it empty if not
For Read, if it's a 404 from Get API, print a message indicating the job might have been purged
Rerunning a plan with no change after 6 months - because of the no-op on a 404 from Get so there should be no diff or any attempt to recreate, which should solve this issue here
For Update - First of all I don't think we support updating jobs today or in the foreseeable future. There is no corresponding API, and the current TF implementation seems to be a no-op (the comment mentions something about labels, but I'm just not seeing that in the code...). We might even just want to throw an error saying we don't support updates at all, for jobs that still exist or not.
For Delete - As it stands we only remove the resource from the state file, this behavior can stay without change
For Import - I think it's like a "first-time Create" and "Rerunning a plan with no change after 6 months" above - if the job has been purged, we don't try to re-import, basically the "always succeeds" behavior.

@rileykarson Could you correct any misunderstanding above? What do you think of the overall plan?

hashicorp / terraform-provider-google