Open johanblumenberg opened 3 months ago
Hi @johanblumenberg!
In order to replicate this scenario, provide the following information:
cloud-run-noauth
terraform apply
and terraform apply -replace google_cloud_run_v2_service.my-service
commands in the order mentioned without making any changes?Remember that sharing accurate and complete information is crucial to finding the bug
Hi @ggtisc
- The code or specifications of the
cloud-run-noauth
It's there in the terraform configuration that I provided:
data "google_iam_policy" "cloud-run-noauth" {
binding {
role = "roles/run.invoker"
members = [
"allUsers",
]
}
}
- Do you make any changes to the terraform code between steps 1, 2 and 3
No, no changes. I can reproduce the bugs following the steps exactly as I wrote in the "Steps to reproduce".
Thanks @johanblumenberg!
This is normal behavior,here's why:
replace flag: The -replace
flag in Terraform instructs it to recreate the specified resource, even if there are no configuration changes detected. In this case, google_cloud_run_v2_service.my-service
is targeted for replacement.
No authentication configuration in the new definition: Your provided Terraform code snippet doesn't include any configuration related to authentication within the google_cloud_run_v2_service.my-service resource
definition.
Data source for no-auth policy: You have a data source named google_iam_policy.cloud-run-noauth
that defines a policy allowing unauthenticated invocations (assigned to "allUsers"). However, this data source is not referenced anywhere within the google_cloud_run_v2_service.my-service resource
.
Default behavior: When Terraform recreates the service (my-service
) without any explicit authentication settings, Cloud Run will most likely apply its default behavior, which is to require authentication
.
In summary:
Running terraform apply -replace google_cloud_run_v2_service.my-service
will trigger a recreation of your Cloud Run service
.
Since there's no authentication configuration within the service definition itself, Cloud Run will likely enforce authentication (not allowing unauthenticated invocations).
Here's what you can do to ensure the desired outcome:
Remove the data source: If you don't intend to allow unauthenticated access, you can remove the data google_iam_policy
cloud-run-noauth
block from your Terraform code.
Manage authentication separately: If you want to manage authentication policies separately, you can define an IAM policy resource that explicitly allows specific users or groups to invoke your Cloud Run service. Then, reference this policy resource within the google_cloud_run_service_iam_policy
block.
Important Note: Allowing unauthenticated access to your Cloud Run service can be a security risk. Make sure you understand the implications before deploying a service with this configuration. You can learn more in terraform registry and each Google Cloud documentation.
Hi @ggtisc
My intent is to allow unauthenticated access.
Since there's no authentication configuration within the service definition itself, Cloud Run will likely enforce authentication (not allowing unauthenticated invocations).
No, there is a configuration in the provided terraform configuration that should allow access to allUsers
. See the google_iam_policy.cloud-run-noauth
data, which is referenced from the google_cloud_run_service_iam_policy.my-service-noauth
resource. It is straight from the example here: https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/cloud_run_service_iam#google_cloud_run_service_iam_policy
I don't think that this is normal behaviour.
When I first apply the configuration, in step 1, it is applied exactly as I expect. I can access the service without providing any authentication.
After I replace the cloud run service in step 2, I can no longer access the service. I am no longer allowed unauthenticated access. I would expect everything to work exactly the same after step 2 as after step 1, since I have not changed anything in the configuration in between.
Doing another apply in step 3 is correcting the problem.
To allow unauthenticated access to your Cloud Run service, you'll need to reintroduce the IAM policy that grants the roles/run.invoker
role to allUsers. You just need to run a terraform apply
, not a terraform apply -replace google_cloud_run_v2_service.my-service
The reason why terraform apply -replace google_cloud_run_v2_service.my-service
might change the authentication behavior lies in the -replace
flag itself.
Here's a breakdown of the expected behavior:
terraform apply
: This is the standard Terraform apply command. It analyzes the current state of your resources (managed by Terraform) and compares it with the desired state defined in your Terraform configuration files. If there are any differences, it applies the necessary changes to achieve the desired state.
In your case, since your code already includes the data
source defining the IAM policy for unauthenticated access, a regular terraform apply
will not change the existing configuration. It will see that the desired state (allowing unauthenticated invocations) already matches the current state and avoids any modifications.
terraform apply -replace google_cloud_run_v2_service.my-service
: The -replace
flag instructs Terraform to force a recreation
of the specified resource, even if the configuration seems identical. In your scenario, even though the desired state remains the same (unauthenticated access), Terraform might not recognize the existing IAM policy association due to the forced recreation. This could lead to Cloud Run resetting the authentication settings to its default (which is typically requiring authentication).
Therefore, to maintain the current configuration (allowing unauthenticated access), you should simply use the regular terraform apply
command. This will ensure Terraform only applies changes when necessary and avoids unintended modifications to the authentication settings.
@ggtisc I think that you are missing the point here.
If I just do terraform apply
everything works as expected. No problem there. I know how to set up a cloud run service using terraform, and I know how to configure access for allUsers
.
I'm not trying to change anything. I'm not trying to apply a new configuration. Everything works correctly, and if I do another terraform apply
everything still works as expected.
The problem that I am trying to report is that when doing terraform apply
and also provide -replace
to replace the terraform service, it breaks the service. I am expecting that after doing terraform apply
everything should be correct according to the provided configuration, but in this case doing terraform apply
leaves the service in a broken state, that does not match the configuration.
The reason for wanting to perform a replace of the service doesn't matter. Remember that this is a minimal reproducible example to demonstrate the bug. Let's say I want to do a replace in order to make sure all instances of the service are restarted.
Please try for yourself to apply the configuration that I have provided, to see the problem.
You need to have a GCP project where you can deploy a cloud run service. When the cloud run service is deployed you need to get the service URL, which you can get from the google cloud console. Run these commands:
terraform apply -var project=$GOOGLE_PROJECT_ID
curl -v $CLOUD_RUN_SERVICE_URL
terraform apply -var project=$GOOGLE_PROJECT_ID -replace google_cloud_run_v2_service.my-service
curl -v $CLOUD_RUN_SERVICE_URL
You will see that the first curl
command after the first apply
command works every time. You will get a 200 response from the service, even if you provide no credentials.
The second curl
command will fail about 50% of the time, returning a 403 response. This means that the service is broken, and it got broken when running the second apply
command.
The problem is that using -replace
when running apply
breaks the service.
After some tries the result was successful without errors making attempts only with terraform apply
and then with terraform apply -replace
expecting the resources to be broken and not just replaced. As the user is reporting this is intermittent and there is no way to know when it is going to happen so due to the time that it has opened and the insistence I'm forwarding it for a deep review at code level.
The IAM policy is treated like a child resource of the service so it's cleaned up when the parent service is deleted. I don't know if there's any way to force Terraform to recreate the IAM policy any time it recreates the service.
For this and other reasons, deleting and recreating a Cloud Run Service can be a bit problematic. It would also cause you to lose previous revisions, for example, which will break any traffic configuration besides "100% to latest". We're adding deletion protection to Cloud Run services for the v6 provider release because of this.
Rather than deleting and recreating the Service, you could add a revision label with a random value to force Terraform to see a change. Unlike with deleting the Service, this would transition everything over to the new instances with 0 downtime.
Hi @bskaplan
If the google_cloud_run_v2_service
resource does not support replacing, then I think that it should be very clearly documented.
I think that most users will expect replacing a resource to just work, leaving the resource in a consistent state, without first consulting the documentation, which makes it very likely that most users will not even read this documentation. The best thing then would perhaps be to fail the apply
command with an error message, saying something like "replacing a cloud run service will leave the service in an inconsistent state, and is not a supported operation". This would prevent services from breaking when the user did not expect it.
It is weird though, because the 4.83.0 version of the provider has no problem with replacing a cloud run service.
It would also cause you to lose previous revisions, for example, which will break any traffic configuration besides "100% to latest".
I don't see this as a problem. If I replace a service I would expect all revisions to be removed, and all traffic to go to latest.
I don't know if there's any way to force Terraform to recreate the IAM policy any time it recreates the service.
A workaround for me would be to replace the IAM policy also, in my apply
command.
But this only works when I already know that the google_cloud_run_v2_service
has a problem with replace. If I don't know, that means I will first break my system once, then spend some time to figure out what the problem is, possibly by reading this thread, and then I can apply the workaround for next time.
It would also cause you to lose previous revisions, for example, which will break any traffic configuration besides "100% to latest".
I don't see this as a problem. If I replace a service I would expect all revisions to be removed, and all traffic to go to latest.
Sure, but my point is that if you had a config of
resource "google_cloud_run_v2_service" "my-service" {
project = var.project
name = "my-service"
location = "us-central1"
template {
containers {
image = "us-docker.pkg.dev/cloudrun/container/hello"
}
}
traffic {
type = "TRAFFIC_TARGET_ALLOCATION_TYPE_REVISION"
revision = "my-service-00012-abc"
percent = 100
}
}
then deleting and re-creating the service would fail the traffic section lists a revision that no longer exists.
The best thing then would perhaps be to fail the apply command with an error message, saying something like "replacing a cloud run service will leave the service in an inconsistent state, and is not a supported operation". This would prevent services from breaking when the user did not expect it.
This is what the deletion_protection field we're adding to v6 does. But this is considered a breaking change so we cannot add it until the major version update.
The best thing then would perhaps be to fail the apply command with an error message, saying something like "replacing a cloud run service will leave the service in an inconsistent state, and is not a supported operation". This would prevent services from breaking when the user did not expect it.
This is what the deletion_protection field we're adding to v6 does. But this is considered a breaking change so we cannot add it until the major version update.
No, it's not the same thing, is it? It would still mean that I have to know that -replace
doesn't work with this resource. If I don't know about that limitation, then I won't add deletion_protection
, and I might try to do a replace operation thinking it will work. But then I will actually break my service.
If -replace
is breaking my service then I would much rather have the apply
cancelled and an error message informing me that using -replace
breaks my service. I would not like the apply
command to complete successfully, but silently put my service in an inconsistent state, not matching the configuration.
The bug here is that doing -replace
on this resource leaves the service in an inconsistent state. The state the service is in is not matching the configuration, although terraform
exits normally, claiming that it is.
What makes it even worse is that the outcome of a replace
operation is random. Sometimes it keeps the IAM policy, but sometimes it does not.
So the options here, as I see it, are either to fix the replace
operation to do what people expect it to do, e.g. not putting the service in a random and inconsistent state, or to accept the fact that replace
is not supported. I don't think it is an option to just leave the service in an inconsistent state.
I think it is strange if it is impossible to fix this bug, because it works in v4 of google_cloud_run_v2_service
and to my knowledge there are no other resources that have this limitation. There are quite a lot of google resources that have an IAM policy attached to them that should work in a similar manner.
traffic { type = "TRAFFIC_TARGET_ALLOCATION_TYPE_REVISION" revision = "my-service-00012-abc" percent = 100 }
If I was using this type of configuration I would not expect -replace
to work on this service. Because I would know that the configuration refers to a previous apply result, and that -replace
would destroy that previous apply result.
I don't think that this example is relevant in this thread? It just shows an example where using -replace
would break your system because your configuration is set up in a way that doesn't allow for it to be replaced. It's not a reason for why -replace
should always break the service, even if it is configured in a way that should allow for replacing it.
Community Note
Terraform Version & Provider Version(s)
Terraform v1.9.2 on Mac M1, Linux x86_64
Affected Resource(s)
google_cloud_run_v2_service, google_cloud_run_service
Terraform Configuration
Debug Output
No response
Expected Behavior
Step 1: Initial apply, nothing strange here. Step 2: This should replace the resource
google_cloud_run_v2_service.my-service
Step 3: This should report that everything is up to date.Actual Behavior
Step 2: The cloud run service is not accessible. Looking in GCP Cloud Console, it is changed from "Allow unauthenticated invocations" to "Require authentication" under the Security tab. Step 3: Reports that the
google_cloud_run_service_iam_policy
resource must be created. Once it is created, the cloud run service is accessible again.When repeating step 2 and 3 a few times, the access problem appears maybe 50% of the times. But the problem that step 3 doesn't report that everything is up to date happens 100% of the time.
This bug happens both in
google_cloud_run_service
andgoogle_cloud_run_v2_service
.It was working using the google provider version 4.83.0.
Steps to reproduce
terraform apply
terraform apply -replace google_cloud_run_v2_service.my-service
terraform apply
Important Factoids
No response
References
No response
b/359335569