Open tlines2016 opened 1 month ago
Not possible to reproduce since it involves thousands of projects
Were looking into this
I've found a way to make ordering not matter for fields without making them sets. I will need to perform some benchmarking to make sure this will increase the performance though.
Community Note
Terraform Version & Provider Version(s)
Terraform v1.2.2 & v1.8.3 (both versions yield the same result) on
Affected Resource(s)
google_access_context_manager_service_perimeter
Terraform Configuration
Config Before the Change
Config After the Change
Terraform Resource Config
Terraform Module
Just to provide a scale of the existing VPC-SC Service Perimeter which is currently being managed by the above resource has both the Enforced Perimeter and Dry Run Perimeter created. This Perimeter is rather large, with a 15,000+ Projects within Enforced, and these same 15,000+ projects are also in Dry Run. The management of projects within the Perimeter is handled by a separate API, and not the above resource.
We also have 40+ Ingress Policies within both enforced and dry run, in addition to 40+ Egress Policies within both as well. Each Ingress/Egress Policy contains a varying number of Identities and Resources coming out to a total of 2,500+ Ingress/Egress Attributes.
With the scale in mind, below is just a small portion of the Terraform Module.
Debug Output
Below was a test with provider v5.40.0, however the same results take place for v5.44.0 as well. The dag/walk step continously takes place over and over for 20+ minutes until it finally completes.
Expected Behavior
With Provider
v4.X
, the runtime of Terraform Plan and Terraform Apply is around 10-20 seconds. This includes the "Refreshing State" step for the Service Perimeter resource, which appears to be the step which it's runtime has increased substantially. When bumping the provider version up tov5.X
, we expected the runtime to remain somewhat consistent and remain in that 10-20 second range or at least below 1 minute. However, instead when it attempts to refresh it's state, it's gone from 10-20 seconds all the way up to 10 to 20 minutes.Actual Behavior
When using Provider
v5.X
for the Access Context Manager Service Perimeter Resource, our Terraform Plan and Apply both have had their runtimes absolutely skyrocket. Particularly during the "Refreshing State" Step for the Service Perimeter Resource. Where it went from that 10-20 seconds to taking 10 to 20 minutes just to refresh the state.Investigating if any sort of quota limits are being hit, that doesn't appear to the case, and the /GET request to fetch the service perimeter still only takes place once just like it did with provider v4.X.
Potential Reason for Performance Issues
With the bump to
v5.x
for the terraform-provider-google, thegoogle_access_context_manager_service_perimeter
resource has switched from using TypeList to using TypeSet for the majority of lists defined in the resource. Reference: 5.0.0 Upgrade GuideYou can also find the code updates within this commit.
So for every single one of our lists within an Ingress Policies, Egress Policy, in addition to resources within the perimeter, each time the flatten and expand function takes place for each of these variables.
Provider v4.X of Expand & Flatten Functions
Provider v5.X Expand and Flatten Functions
So now take into account that our Service Perimeter resource has 15,000+ Resources within SpecResources, and then another 15,000+ resources within StatusResources. And each time these flatten and expand functions are being called where they're either converting to a Set with hashstrings, or taking the set and returning a list. Whereas before they always just dealt with lists where no hashing was taking place. Now add that this change to using Sets and hashstrings took place to each of the resources below, and I believe this is what is causing the substantial time increase. Where that time increase has caused the provider
v5.X
to be unusable.This is making it so we can not take advantage of new feature releases for the VPC-SC Service such as defining Source -> Access Levels within an Egress Policy.
Steps to reproduce
tf init -input=false
tf plan -input=false
tf apply -input=false --auto-approve
The difficult part for others to reproduce this is the size of the service perimeter that we're dealing with. Such as the 15,000+ Projects.
Also, the performance decrease is taking place during the "Refreshing State" step for the Service Perimeter resource. So an existing perimeter needs to already be created.
Important Factoids
The reason I believe it's the size of the Perimeter resource is because I upgrade the provider from v4.x to v5.x in a separate environment with no issues. However, that Perimeter only contains 45 projects and just a few ingress/egress policies whereas the one facing performance issues contains 15,000+ projects and quite a few more ingress/egress policies.
Keep in mind, it's the refreshing state step during the dag/walk which is taking the substantial amount of time. Simply running Terraform Plan with inputs won't replicate this. It's only occurring when there is a resource which already exists which contains a large number of attributes. Where the provider has to update the existing state.
For the Perimeter facing the performance issues, it's still able to run a terraform plan and apply successfully. It's just the case where even after the plan and apply were completed, every subsequent run after also face the same performance issues making it so that to make one minor change to an Ingress Policy for example, takes upwards of an hour via our pipeline.
If possible we'd like to avoid having to migrate everything to the new resources which have been created, and continue to utilize the service_perimeter resource for managing the majority of our perimeters configuration. This is because VPC-SC is a service that can affect an entire environment and the risk of outages within an environment are rather high when making major changes to the resource.
VPC-SC Quota Limits: If it is the case where the use of TypeSet and the use of hashes is causing this major difference in performance, then it constrains the
resource_access_context_manager_service_perimeter
greatly in terms of what it is capable of compared to what Google allows for the service. For example, the Default Quota Limit allows for up to 40k projects defined in a service perimeter. If a user goes with thesingle unified perimeter
approach which is recommended by Google, then this Terraform Resource becomes almost unusable just based off it's performance issues where it can take upwards of 45 minutes to an hour for changes to be made to the perimeter. Where with 4.X, that change would only take a couple of minutes to run through all of the pipeline steps.References
b/368651673