Replace of google_cloud_run_v2_service breaks authorization setting

hashicorp / terraform-provider-google

Terraform Provider for Google Cloud Platform

https://registry.terraform.io/providers/hashicorp/google/latest/docs

Mozilla Public License 2.0

2.33k stars 1.73k forks source link

Replace of google_cloud_run_v2_service breaks authorization setting #18831

Open johanblumenberg opened 3 months ago

johanblumenberg commented 3 months ago

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
If you are interested in working on this issue or have submitted a pull request, please leave a comment.
If an issue is assigned to a user, that user is claiming responsibility for the issue.
Customers working with a Google Technical Account Manager or Customer Engineer can ask them to reach out internally to expedite investigation and resolution of this issue.

Terraform Version & Provider Version(s)

Terraform v1.9.2 on Mac M1, Linux x86_64

provider registry.terraform.io/hashicorp/google v5.38.0

Affected Resource(s)

google_cloud_run_v2_service, google_cloud_run_service

Terraform Configuration

variable "project" {
  type = string
}

data "google_iam_policy" "cloud-run-noauth" {
  binding {
    role = "roles/run.invoker"
    members = [
      "allUsers",
    ]
  }
}

resource "google_cloud_run_v2_service" "my-service" {
  project  = var.project
  name     = "my-service"
  location = "us-central1"

  template {
    containers {
      image = "us-docker.pkg.dev/cloudrun/container/hello"
    }
  }
}

resource "google_cloud_run_service_iam_policy" "my-service-noauth" {
  location    = google_cloud_run_v2_service.my-service.location
  project     = google_cloud_run_v2_service.my-service.project
  service     = google_cloud_run_v2_service.my-service.name
  policy_data = data.google_iam_policy.cloud-run-noauth.policy_data
}

Debug Output

No response

Expected Behavior

Step 1: Initial apply, nothing strange here. Step 2: This should replace the resource google_cloud_run_v2_service.my-service Step 3: This should report that everything is up to date.

Actual Behavior

Step 2: The cloud run service is not accessible. Looking in GCP Cloud Console, it is changed from "Allow unauthenticated invocations" to "Require authentication" under the Security tab. Step 3: Reports that the google_cloud_run_service_iam_policy resource must be created. Once it is created, the cloud run service is accessible again.

When repeating step 2 and 3 a few times, the access problem appears maybe 50% of the times. But the problem that step 3 doesn't report that everything is up to date happens 100% of the time.

This bug happens both in google_cloud_run_service and google_cloud_run_v2_service.

It was working using the google provider version 4.83.0.

Steps to reproduce

terraform apply
terraform apply -replace google_cloud_run_v2_service.my-service
terraform apply

Important Factoids

No response

References

No response

b/359335569

ggtisc commented 2 months ago

Hi @johanblumenberg!

In order to replicate this scenario, provide the following information:

The code or specifications of the cloud-run-noauth
Do you make any changes to the terraform code between steps 1, 2 and 3 or do you just run the terraform apply and terraform apply -replace google_cloud_run_v2_service.my-service commands in the order mentioned without making any changes?

Remember that sharing accurate and complete information is crucial to finding the bug

johanblumenberg commented 2 months ago

Hi @ggtisc

The code or specifications of the cloud-run-noauth

It's there in the terraform configuration that I provided:

data "google_iam_policy" "cloud-run-noauth" {
  binding {
    role = "roles/run.invoker"
    members = [
      "allUsers",
    ]
  }
}

Do you make any changes to the terraform code between steps 1, 2 and 3

No, no changes. I can reproduce the bugs following the steps exactly as I wrote in the "Steps to reproduce".

ggtisc commented 2 months ago

Thanks @johanblumenberg!

This is normal behavior,here's why:

replace flag: The -replace flag in Terraform instructs it to recreate the specified resource, even if there are no configuration changes detected. In this case, google_cloud_run_v2_service.my-service is targeted for replacement.
No authentication configuration in the new definition: Your provided Terraform code snippet doesn't include any configuration related to authentication within the google_cloud_run_v2_service.my-service resource definition.
Data source for no-auth policy: You have a data source named google_iam_policy.cloud-run-noauth that defines a policy allowing unauthenticated invocations (assigned to "allUsers"). However, this data source is not referenced anywhere within the google_cloud_run_v2_service.my-service resource.
Default behavior: When Terraform recreates the service (my-service) without any explicit authentication settings, Cloud Run will most likely apply its default behavior, which is to require authentication.

In summary:

Running terraform apply -replace google_cloud_run_v2_service.my-service will trigger a recreation of your Cloud Run service.
Since there's no authentication configuration within the service definition itself, Cloud Run will likely enforce authentication (not allowing unauthenticated invocations).

Here's what you can do to ensure the desired outcome:

Remove the data source: If you don't intend to allow unauthenticated access, you can remove the data google_iam_policy cloud-run-noauth block from your Terraform code.
Manage authentication separately: If you want to manage authentication policies separately, you can define an IAM policy resource that explicitly allows specific users or groups to invoke your Cloud Run service. Then, reference this policy resource within the google_cloud_run_service_iam_policy block.

Important Note: Allowing unauthenticated access to your Cloud Run service can be a security risk. Make sure you understand the implications before deploying a service with this configuration. You can learn more in terraform registry and each Google Cloud documentation.

johanblumenberg commented 2 months ago

Hi @ggtisc

My intent is to allow unauthenticated access.

Since there's no authentication configuration within the service definition itself, Cloud Run will likely enforce authentication (not allowing unauthenticated invocations).

No, there is a configuration in the provided terraform configuration that should allow access to allUsers. See the google_iam_policy.cloud-run-noauth data, which is referenced from the google_cloud_run_service_iam_policy.my-service-noauth resource. It is straight from the example here: https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/cloud_run_service_iam#google_cloud_run_service_iam_policy

johanblumenberg commented 2 months ago

I don't think that this is normal behaviour.

When I first apply the configuration, in step 1, it is applied exactly as I expect. I can access the service without providing any authentication.

After I replace the cloud run service in step 2, I can no longer access the service. I am no longer allowed unauthenticated access. I would expect everything to work exactly the same after step 2 as after step 1, since I have not changed anything in the configuration in between.

Doing another apply in step 3 is correcting the problem.

ggtisc commented 2 months ago

To allow unauthenticated access to your Cloud Run service, you'll need to reintroduce the IAM policy that grants the roles/run.invoker role to allUsers. You just need to run a terraform apply, not a terraform apply -replace google_cloud_run_v2_service.my-service

The reason why terraform apply -replace google_cloud_run_v2_service.my-service might change the authentication behavior lies in the -replace flag itself.

Here's a breakdown of the expected behavior:

terraform apply: This is the standard Terraform apply command. It analyzes the current state of your resources (managed by Terraform) and compares it with the desired state defined in your Terraform configuration files. If there are any differences, it applies the necessary changes to achieve the desired state.

In your case, since your code already includes the data source defining the IAM policy for unauthenticated access, a regular terraform apply will not change the existing configuration. It will see that the desired state (allowing unauthenticated invocations) already matches the current state and avoids any modifications.

terraform apply -replace google_cloud_run_v2_service.my-service: The -replace flag instructs Terraform to force a recreation of the specified resource, even if the configuration seems identical. In your scenario, even though the desired state remains the same (unauthenticated access), Terraform might not recognize the existing IAM policy association due to the forced recreation. This could lead to Cloud Run resetting the authentication settings to its default (which is typically requiring authentication).

Therefore, to maintain the current configuration (allowing unauthenticated access), you should simply use the regular terraform apply command. This will ensure Terraform only applies changes when necessary and avoids unintended modifications to the authentication settings.

johanblumenberg commented 2 months ago

@ggtisc I think that you are missing the point here.

If I just do terraform apply everything works as expected. No problem there. I know how to set up a cloud run service using terraform, and I know how to configure access for allUsers.

I'm not trying to change anything. I'm not trying to apply a new configuration. Everything works correctly, and if I do another terraform apply everything still works as expected.

The problem that I am trying to report is that when doing terraform apply and also provide -replace to replace the terraform service, it breaks the service. I am expecting that after doing terraform apply everything should be correct according to the provided configuration, but in this case doing terraform apply leaves the service in a broken state, that does not match the configuration.

The reason for wanting to perform a replace of the service doesn't matter. Remember that this is a minimal reproducible example to demonstrate the bug. Let's say I want to do a replace in order to make sure all instances of the service are restarted.

Please try for yourself to apply the configuration that I have provided, to see the problem.

You need to have a GCP project where you can deploy a cloud run service. When the cloud run service is deployed you need to get the service URL, which you can get from the google cloud console. Run these commands:

terraform apply -var project=$GOOGLE_PROJECT_ID
curl -v $CLOUD_RUN_SERVICE_URL
terraform apply -var project=$GOOGLE_PROJECT_ID -replace google_cloud_run_v2_service.my-service
curl -v $CLOUD_RUN_SERVICE_URL

You will see that the first curl command after the first apply command works every time. You will get a 200 response from the service, even if you provide no credentials. The second curl command will fail about 50% of the time, returning a 403 response. This means that the service is broken, and it got broken when running the second apply command.

johanblumenberg commented 2 months ago

The problem is that using -replace when running apply breaks the service.

ggtisc commented 2 months ago

After some tries the result was successful without errors making attempts only with terraform apply and then with terraform apply -replace expecting the resources to be broken and not just replaced. As the user is reporting this is intermittent and there is no way to know when it is going to happen so due to the time that it has opened and the insistence I'm forwarding it for a deep review at code level.

bskaplan commented 2 months ago

The IAM policy is treated like a child resource of the service so it's cleaned up when the parent service is deleted. I don't know if there's any way to force Terraform to recreate the IAM policy any time it recreates the service.

For this and other reasons, deleting and recreating a Cloud Run Service can be a bit problematic. It would also cause you to lose previous revisions, for example, which will break any traffic configuration besides "100% to latest". We're adding deletion protection to Cloud Run services for the v6 provider release because of this.

Rather than deleting and recreating the Service, you could add a revision label with a random value to force Terraform to see a change. Unlike with deleting the Service, this would transition everything over to the new instances with 0 downtime.

johanblumenberg commented 2 months ago

Hi @bskaplan

If the google_cloud_run_v2_service resource does not support replacing, then I think that it should be very clearly documented.

I think that most users will expect replacing a resource to just work, leaving the resource in a consistent state, without first consulting the documentation, which makes it very likely that most users will not even read this documentation. The best thing then would perhaps be to fail the apply command with an error message, saying something like "replacing a cloud run service will leave the service in an inconsistent state, and is not a supported operation". This would prevent services from breaking when the user did not expect it.

It is weird though, because the 4.83.0 version of the provider has no problem with replacing a cloud run service.

johanblumenberg commented 2 months ago

It would also cause you to lose previous revisions, for example, which will break any traffic configuration besides "100% to latest".

I don't see this as a problem. If I replace a service I would expect all revisions to be removed, and all traffic to go to latest.

johanblumenberg commented 2 months ago

I don't know if there's any way to force Terraform to recreate the IAM policy any time it recreates the service.

A workaround for me would be to replace the IAM policy also, in my apply command.

But this only works when I already know that the google_cloud_run_v2_service has a problem with replace. If I don't know, that means I will first break my system once, then spend some time to figure out what the problem is, possibly by reading this thread, and then I can apply the workaround for next time.

bskaplan commented 2 months ago

It would also cause you to lose previous revisions, for example, which will break any traffic configuration besides "100% to latest".

I don't see this as a problem. If I replace a service I would expect all revisions to be removed, and all traffic to go to latest.

Sure, but my point is that if you had a config of

resource "google_cloud_run_v2_service" "my-service" {
  project  = var.project
  name     = "my-service"
  location = "us-central1"

  template {
    containers {
      image = "us-docker.pkg.dev/cloudrun/container/hello"
    }
  }
  traffic {
    type = "TRAFFIC_TARGET_ALLOCATION_TYPE_REVISION"
    revision = "my-service-00012-abc"
    percent = 100
  }
}

then deleting and re-creating the service would fail the traffic section lists a revision that no longer exists.

The best thing then would perhaps be to fail the apply command with an error message, saying something like "replacing a cloud run service will leave the service in an inconsistent state, and is not a supported operation". This would prevent services from breaking when the user did not expect it.

This is what the deletion_protection field we're adding to v6 does. But this is considered a breaking change so we cannot add it until the major version update.

johanblumenberg commented 2 months ago

The best thing then would perhaps be to fail the apply command with an error message, saying something like "replacing a cloud run service will leave the service in an inconsistent state, and is not a supported operation". This would prevent services from breaking when the user did not expect it.

This is what the deletion_protection field we're adding to v6 does. But this is considered a breaking change so we cannot add it until the major version update.

No, it's not the same thing, is it? It would still mean that I have to know that -replace doesn't work with this resource. If I don't know about that limitation, then I won't add deletion_protection, and I might try to do a replace operation thinking it will work. But then I will actually break my service.

If -replace is breaking my service then I would much rather have the apply cancelled and an error message informing me that using -replace breaks my service. I would not like the apply command to complete successfully, but silently put my service in an inconsistent state, not matching the configuration.

johanblumenberg commented 2 months ago

The bug here is that doing -replace on this resource leaves the service in an inconsistent state. The state the service is in is not matching the configuration, although terraform exits normally, claiming that it is.

What makes it even worse is that the outcome of a replace operation is random. Sometimes it keeps the IAM policy, but sometimes it does not.

So the options here, as I see it, are either to fix the replace operation to do what people expect it to do, e.g. not putting the service in a random and inconsistent state, or to accept the fact that replace is not supported. I don't think it is an option to just leave the service in an inconsistent state.

I think it is strange if it is impossible to fix this bug, because it works in v4 of google_cloud_run_v2_service and to my knowledge there are no other resources that have this limitation. There are quite a lot of google resources that have an IAM policy attached to them that should work in a similar manner.

johanblumenberg commented 2 months ago

  traffic {
    type = "TRAFFIC_TARGET_ALLOCATION_TYPE_REVISION"
    revision = "my-service-00012-abc"
    percent = 100
  }

If I was using this type of configuration I would not expect -replace to work on this service. Because I would know that the configuration refers to a previous apply result, and that -replace would destroy that previous apply result.

I don't think that this example is relevant in this thread? It just shows an example where using -replace would break your system because your configuration is set up in a way that doesn't allow for it to be replaced. It's not a reason for why -replace should always break the service, even if it is configured in a way that should allow for replacing it.