hashicorp / terraform-provider-google

Terraform Provider for Google Cloud Platform
https://registry.terraform.io/providers/hashicorp/google/latest/docs
Mozilla Public License 2.0
2.33k stars 1.73k forks source link

Failing test(s): TestAccTPUNode_tpuNodeFullExample #12901

Open melinath opened 2 years ago

melinath commented 2 years ago

Affected Resource(s)

Failure rate: 100% since 2022-10-08 Failure rate: 32% in Mar 2023

Impacted tests:

Nightly builds:

Message:

Error: Error waiting to create Node: Error waiting for Creating Node: Error code 3, message: Cloud TPU was unable to complete the operation. Please try again, or contact support if the problem persists. [EID: 0xd8dde9434b8d0ab5]

Note: this is separate from https://github.com/hashicorp/terraform-provider-google/issues/10222 which is flakey due to capacity issues.

AlfatahB commented 1 year ago

TestAccTPUNode_tpuNodeFullExample Google Cloud - 35.2% failure Google Cloud Beta - 36% failure

AlfatahB commented 1 year ago

b/261834151

SarahFrench commented 1 year ago

There are also failures like this that don't happen consistently:

provider_test.go:320: Step 1/2 error: After applying this test step, the plan was not empty.
stdout:
Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
~ update in-place
Terraform will perform the following actions:
# google_tpu_node.tpu will be updated in-place
~ resource "google_tpu_node" "tpu" {
id                     = "projects/[PROJECT]/locations/us-central1-b/nodes/tf-test-test-tpubdrsxsbxw1"
name                   = "tf-test-test-tpubdrsxsbxw1"
~ tensorflow_version     = "1.15.3" -> "1.15.4"
# (10 unchanged attributes hidden)
# (1 unchanged block hidden)
}
Plan: 0 to add, 1 to change, 0 to destroy.

In the config tensorflow_version is set using data.google_tpu_tensorflow_versions.available.versions[0], and I wonder if it's because the google_tpu_tensorflow_versions datasource is returning different values between the first plan+apply and then the second plan step.

Maybe by provisioning something in a given zone we affect the "zonal availability" of TPU resources in that zone, and that affects the values returned by projects.locations.tensorflowVersions/list?

roaks3 commented 1 year ago

This is now only failing with the error in @SarahFrench 's comment. data.google_tpu_tensorflow_versions.available.versions[0] is unfortunately using the oldest available version of Tensorflow, which I don't think we want for these tests. It's possible that is related to the inconsistency (the idea of the test impacting zonal availability seems plausible too). Note that data.google_tpu_tensorflow_versions.available.versions also includes versions that are not stable releases, so we can't just use the last version in the list, and IMO we will most likely need to change these tests back to using a hard-coded Tensorflow version.

This test failed at 32% in Mar 2023, and it does come up for some of the other TPU tests as well.