Open mromascanu123 opened 2 months ago
Reference: last full 3-networks-hub-and-spoke apply - up to 5-app-infra - TF 1.3.10 to avoid the issue running cloudbuild with 1.3 - if we use the default 1.7.5 (since downgraded in 1.5.7 in cloud shell) Env: cloud shell and CB/CSR
I will also retest 3-nhas as soon as I finish the TEF upstream sync for 20240511 main in https://github.com/GoogleCloudPlatform/pbmm-on-gcp-onboarding/issues/387 to reverify 3-nhas. There are 2 symlinks in nonproduction that need to be reverted un #1107 but they function with a double symlink ok for now.
If there are too many simultaneous operations on peering, this will occur. It is not occurring in our integration tests. Are environments being deployed in parallel. One workaround is to set in Terraform parallel=1, but it will make the build take a long time, as you are not running in parallel.
@sleighton2022 : there is no parallelism here and deployment is done manually. It does not happen every time, not even often. I've got one of these on 05/30 and another one today. However the problem is deeper and nastier. In both cases when one of these occurred it was associated with tfstate corruption. On 05/30 it occurred during the "apply" for "3-nhas" production and today during tf "apply" for 3-nhas development. Apparently and superficially it seemed that a retry (tf plan then apply) fixed the issue both on 05/30 and today and 3-nhas was apparently deployed without error. In reality the tfstate for the stage where the error occurred (prod on 05/30 and dev today) was corrupted and was missing variables supposed to have been generated by outputs.tf. As a result when deploying 4-projects these variables won't be found and the deployment fails for good.
Example : after today's failed deployment compared the tfstate files under key "networks" and while prod and nprod were containing same output variables (different values) quite a few were missing for dev
more precisely the below were missing, possibly other vars "base_network_self_link": { value = module.base_env.base_network_self_link description = "The URI of the VPC being created" }
"base_subnets_secondary_ranges": { value = module.base_env.base_subnets_secondary_ranges description = "The secondary ranges associated with these subnets" }
"base_subnets_self_links": { value = module.base_env.base_subnets_self_links description = "The self-links of subnets being created" }
Interestingly, same issue reported with project-factory module but apparently not directly related to TEF. People reporting these think the error points to a race condition
Cloud DNS and Peering - Terraform Providers / Google - HashiCorp Discuss Peering Fails with "There is a peering operation in progress" · Issue #3026 · hashicorp/terraform-provider-google (github.com) GCP Peering does not work · Issue #3034 · hashicorp/terraform-provider-google (github.com)
TL;DR
This happens almost every time when deploying dev, nprod or prod. Have to plan and apply again and everything is fine . But this kind of error will ruin any pipeline deploying automatically the spokes
. . . module.base_env.module.restricted_shared_vpc[0].module.regular_service_perimeter.google_access_context_manager_service_perimeter.regular_service_perimeter: Creating... module.base_env.module.restricted_shared_vpc[0].module.regular_service_perimeter.google_access_context_manager_service_perimeter.regular_service_perimeter: Creation complete after 3s [id=accessPolicies/6329355927/servicePerimeters/sp_n_shared_restricted_default_perimeter_e480] module.base_env.module.restricted_shared_vpc[0].module.regular_service_perimeter.google_access_context_manager_service_perimeter_resource.service_perimeter_resource["115822756025"]: Creating... module.base_env.module.restricted_shared_vpc[0].module.regular_service_perimeter.google_access_context_manager_service_perimeter_resource.service_perimeter_resource["115822756025"]: Creation complete after 2s [id=accessPolicies/6329355927/servicePerimeters/sp_n_shared_restricted_default_perimeter_e480/projects/115822756025] module.base_env.module.restricted_shared_vpc[0].google_access_context_manager_service_perimeter.bridge_to_network_hub_perimeter[0]: Creating... module.base_env.module.restricted_shared_vpc[0].google_access_context_manager_service_perimeter.bridge_to_network_hub_perimeter[0]: Creation complete after 0s [id=accessPolicies/6329355927/servicePerimeters/spb_c_to_n_shared_restricted_bridge_e480]
Error: Error adding network peering: googleapi: Error 400: There is a route operation in progress on the local or peer network. Try again later., badRequest
with module.base_env.module.base_shared_vpc[0].module.peering[0].google_compute_network_peering.peer_network_peering, on .terraform/modules/base_env.base_shared_vpc.peering/modules/network-peering/main.tf line 50, in resource "google_compute_network_peering" "peer_network_peering": 50: resource "google_compute_network_peering" "peer_network_peering" {
Did not investigate in detail what's going on, might be a race condition / unaccounted for dependency
Expected behavior
Should smoothly deploy - why the 2'nd time succeeds?
Observed behavior
Look at TL;DR* Error: Error adding network peering: googleapi: Error 400: There is a route operation in progress on the local or peer network. Try again later., badRequest
Terraform Configuration
Terraform Version
Additional information
No response