Usability improvements for *_iam_policy and *_iam_binding resources

hashicorp / terraform-provider-google

Terraform Provider for Google Cloud Platform

https://registry.terraform.io/providers/hashicorp/google/latest/docs

Mozilla Public License 2.0

2.28k stars 1.72k forks source link

Usability improvements for _iam_policy and _iam_binding resources #8354

Open ericnorris opened 3 years ago

ericnorris commented 3 years ago

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment. If the issue is assigned to the "modular-magician" user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If the issue is assigned to a user, that user is claiming responsibility for the issue. If the issue is assigned to "hashibot", a community member has claimed the issue already.

Description

I'm sure you know by now there is a decent amount of care required when using the *_iam_policy and *_iam_binding versions of IAM resources. There are a number of "be careful!" and "note" warnings in the resources that outline some of the potential pitfalls, but there are hidden dangers as well. For example, using the google_project_iam_policy resource may inadvertently remove Google's service agents' (https://cloud.google.com/iam/docs/service-agents) IAM roles from the project. Or, the dangers of using google_storage_bucket_iam_policy and google_storage_bucket_iam_binding, which may remove the default IAM roles granted to projectViewers:, projectEditors:, and projectOwners: of the containing project.

The largest issue I encounter with people running into the above situations is that the initial terraform plan does not show that anything is being removed. While the documentation for google_project_iam_policy notes that it's best to terraform import the resource beforehand, this is in fact applicable to all *_iam_policy and *_iam_binding resources. Unfortunately this is tedious, potentially forgotten, and not something that you can abstract away in a Terraform module.

On the other hand, the more authoritative forms of IAM resources are beneficial for ensuring repeatability, discoverability, and security, so I prefer them over the simpler *_iam_member resources.

I'd like to suggest that on creation the *_iam_policy and *_iam_binding resources act as strictly additive, similar to the *_iam_member resources, but remain authoritative on update. Doing this will provide feedback in the next terraform plan of IAM principles that may have accidentally been left out and would have potentially been removed.

If this is undesirable, an alternative may be to fail hard on an existing IAM policy or binding and requiring that the resource be imported, but I prefer the above suggestion.

Finally, I'd like to propose that the specific resource implementations be more opinionated in how they operate. Authoritative google_project_iam_* resources could (optionally, of course) filter out https://cloud.google.com/iam/docs/service-agents from the diffs, as there is likely very little reason to manipulate their permissions. Similarly the authoritative google_storage_bucket_* resources could, again optionally, preserve the bindings that are granted to projectViewers, etc.

This may be a harder sell than the first suggestion, but I think anything that Google can do to reduce the burden on developers understanding the implicit IAM grants that their products get would be a huge benefit to many. I could potentially see Google exposing API endpoints of "expected bindings" or similar to remove the burden from the provider itself, but that is likely beyond the scope of this issue.

New or Affected Resource(s)

google_project_iam_policy
google_project_iam_binding
google_storage_bucket_iam_policy
google_storage_bucket_iam_binding
...

Potential Terraform Configuration

No changes, though this could be a new resource name or parameter.

References

Not exhaustive, and likely has some false postives:

https://github.com/hashicorp/terraform-provider-google/issues?q=is%3Aissue+%22authoritative%22+%22IAM%22

melinath commented 3 years ago

Hi, Eric - thanks for the detailed issue! Unfortunately, there would be some significant problems with making the provider be strictly additive on create. From what I understand, that's how the provider used to work, and it caused a lot of confusion for folks. A cleaner solution would require changes to how Terraform providers work more generally; we will make the case to Hashicorp to provide an API that allows us to better support this type of resource.

ericnorris commented 3 years ago

Hey @melinath, thanks for following up!

Would it be possible for you to link some more information on what kind of problems folks had ran into in the past? If I understand correctly the google_project_iam_policy resource used to have an authoritative parameter that you could configure, which understandably could cause confusion for people.

In this case, however, I am not suggesting we make it configurable. It should only be non-authoritative during creation, and revert to the current behavior otherwise. This means that from a purely user-interacting-with-Terraform standpoint there is no difference during the initial apply, but subsequent plans would show a diff - and rightfully so, as it will provide information to the user about IAM principles that are not present in their Terraform configuration.

I'm not sure that I agree that it would be more confusing this way; from internal usage at my company it is far more confusing that people have important IAM bindings blown away without any warning, compared to any confusion that might arise from seeing a diff on your next plan. The issues caused by removed IAM bindings are incredibly hard to debug, as it surfaces to users as their subsequent Google API calls failing for reasons that they do not understand. It is only after digging through Cloud Audit logs, or by having being bit by this in the past, that you realize that the user accidentally removed IAM permissions from a critical Google-managed service account.

My fear with trying to get this into Terraform core is that it would fundamentally be a new Terraform paradigm, might not be accepted by upstream, and even if so, would be out of reach of users at my company for quite some time. Making a change to the Google provider, however, is something that could be done quickly, is relatively simple, and (as far as I can tell) is safe.

That being said, an acceptable Terraform core solution could be for resources to optionally auto-import themselves. There is only ever one google_project_iam_policy per project, so if the resource were to automatically import itself and report the changes during the first terraform plan / terraform apply cycle I think my issue would be solved.

upodroid commented 3 years ago

Authoritative google_project_iam_* resources could (optionally, of course) filter out https://cloud.google.com/iam/docs/service-agents from the diffs, as there is likely very little reason to manipulate their permissions.

FYI this is not true. Things like Shared VPC/cross project service-accounts/services routinely require you to modify IAM roles granted to Google Service Agents.

I understand the frustations with google_project_iam_policy resource and I have personally seen broken projects because people never read the warnings plastered on the docs. as melinath said there are no easy solutions right now.

This module is very helpful https://github.com/terraform-google-modules/terraform-google-iam/tree/master/modules/projects_iam

Acts like google_project_iam_policy for the roles defined in the module. Under the hood, it creates google_project_iam_binding for the specified roles. The problem is still present when new role bindings are declared in the module and not reconciled in terraform, but the blast radius is much lower.

ericnorris commented 3 years ago

FYI this is not true. Things like Shared VPC/cross project service-accounts/services routinely require you to modify IAM roles granted to Google Service Agents.

@upodroid good point! You can scratch that from my original proposal.

as melinath said there are no easy solutions right now.

Obviously I'm biased here, but I believe my initial suggestion of "additive on create" is easy to use and easy to implement. In fact, the logic is present in resource_iam_binding.go and resource_iam_member.go, so it would only need to be refactored a bit.

I fail to see what drawbacks there are to this approach - yes, there may be some confusion for folks who see IAM principles show up in a diff in their next plan, but that confusion is much simpler compared to "these IAM principles were removed and no Terraform logs will show that", which is the current case.

This module is very helpful...

Thanks for that link, wasn't aware. That being said, my company often uses roles/editor, which conflicts with a number of Google Service Agents (e.g. Container Registry, if I recall correctly). It also doesn't address one of my earlier points with google_storage_bucket_iam_policy and google_storage_bucket_iam_binding, that binding roles/storage.legacyBucketOwner will remove implicit bindings granted by Google Cloud Storage (I realize this module is meant for projects, not Cloud Storage buckets).

I agree that it reduces the blast radius, as conflicts would be limited to only roles that a human would conceivably want to use (e.g. not the roles/*serviceAgent roles), but unless we banned the use of these conflicting roles and kept track of any new conflicts with roles that Google service agents get or are implicitly bound on a product, we'd continue to see problems.

Finally, I'd like to add that I was able to achieve something close to a "safe" google_project_iam_policy resource by scraping the gcloud services list --format json output. This command will list services and their service accounts, which for some reason is not documented in the Service Usage REST API. By requiring the user to pass in a list of services they have enabled and matching it with this JSON, I am able to merge passed in role bindings with the ones that Google expects to have. Ideally there could be a google_project_services data source that would allow this to happen automatically.

I prefer my original suggestion as the above feels a lot more complex than what a module should be taking on, and it apparently relies on some undocumented API output, but I would be open to discussing other options too.

melinath commented 3 years ago

@ericnorris yeah, the issue with making it non-authoritative is exactly what you said: the initial apply goes through and then the following apply has a diff. Although this doesn't bother everyone, there are a significant number of users for whom it's a large concern, either because they are new to terraform and don't understand what's happening, or because they want terraform to make their state be exactly what they expect after a single apply and don't want to have to determine whether some resources need to be updated again.

TLDR the solution we want to propose to hashicorp is exactly what you suggested at the end of your previous message: being able to specify that a particular resource should try an auto-import if it doesn't exist in state yet. :-)

ericnorris commented 3 years ago

Understood, and it may end up being that we fundamentally disagree about the following point:

Although this doesn't bother everyone, there are a significant number of users for whom it's a large concern, either because they are new to terraform and don't understand what's happening, or because they want terraform to make their state be exactly what they expect after a single apply and don't want to have to determine whether some resources need to be updated again.

I want to emphasize the "are new to terraform and don't understand what's happening" group. With additive on create, they will at least be told what is happening. With how it is now, they are not; the Terraform logs will show nothing out of the ordinary during the plan and apply. These users are also the least equipped to be able to then trudge through Google documentation and Cloud Audit logs to determine what roles are missing after they've removed them silently. If they came to me (or you) with confusion over an unexpected diff, it would be far easier to explain what is happening.

I cannot see how the current state of affairs is better than upsetting the other group of folks, the "they want terraform to make their state be exactly what they expect after a single apply" group. This already is not the case in a number of situations - in fact, applying a google_project_iam_binding and a google_project_service resource that implicitly modifies that binding in the same run will also result in a diff on the next plan and apply.

More concretely:

resource "google_project_iam_binding" "editor" {
  role = "roles/editor"

  members = [
    # ...
  ]
}

resource "google_project_service" "containerregistry" {
  service = "containerregistry.googleapis.com"
}

...without knowing ahead of time and doing something with depends_on, will cause a service account to show up in a later diff. This is just one example, there are other situations in which you might need to run Terraform a couple times in order for things to become eventually consistent.

Even if the "auto-import" feature became a thing, again, these users will almost certainly always have to deal with the fact that they are working with abstractions over abstractions, and eventual consistency is the only thing they can count on.

At a minimum I would be happy seeing this as either a separate resource, or a parameter on the existing resource, if you wanted to preserve behavior for the latter group while allowing my company to use something that I believe is friendlier to the former group.

All that said though, I appreciate your time discussing this!

rileykarson commented 3 years ago

I wanted to weigh in with a couple of points of my own here:

Prior to this change, we've considered it a bug in almost all cases if the provider returns a diff after a terraform apply. It's fairly common for users to run Terraform in automation, and those systems would need to be adapted to run terraform apply twice. In my opinion that makes this a major breaking change, as it significantly impacts the behaviour of the resource.
There is definitely a space to improve the behaviour here, and has been for years! However I'm not aware of anything in the provider's toolbox that makes it possible today short of this proposal or the one we've taken. I can definitely see the merits of this proposed approach- we actually had this behaviour during the 1.X releases. Even though it's much safer in theory, it led to confusion when used in multi-operator scenarios. During the 2.0.0 release we felt that keeping the resource authoritative was the most consistent UX we could provide, particularly because terraform plan accurately reports the end state of the policy/binding.
I've seen some discussion around guarding the behaviour behind a boolean. That isn't as good of a solution as it feels like it should be in many cases, unfortunately. Not only is discovery difficult with resources often becoming over-configurable, I've found that there are a lot of problems introduced as users try to transition behaviour-modifying values from one state to another. This case is probably more restrained, but opt-in behaviour here wouldn't help most users keep their policies safe.

I think that carrying forward a discussion to HashiCorp about a pattern for "import on create" for a subset of resources (I personally refer to it as singleton resources, as there's exactly one and they always exist) would be the best solution here. It could be introduced in a minor version of the provider as it would be consistent with the current terraform apply behaviour, and would just amend the "before" state of terraform plan.

ericnorris commented 3 years ago

Thanks @rileykarson!

Prior to this change, we've considered it a bug in almost...

I understand why this would be the ideal, but I mentioned a case where it happens now, and it would continue to happen even with an auto-import functionality. I am sure you are well aware, but there will almost certainly always be some resources that cause delayed effects that cannot be expressed in a single apply.

As another (non-IAM) example, consider:

resource "google_project_service" "appengine" {
  service = "appengine.googleapis.com"
}

resource "google_app_engine_application" "needed_for_cloud_scheduler" {
  ...
}

resource "google_cloud_scheduler_job" "a_job" {
  ...
}

...even with depends_on, there is a bit of a race condition during the apply that will almost certainly fail the first time you try to run it. The only answer is to run plan and apply again.

There is definitely a space to improve the behaviour here...

@dossett is actually a member of my company :) I believe the confusion you are referring to here in the linked issue comes with competing IAM resources, which I think will continue to be a problem regardless of my proposal or using auto-import. That being said, if iam_binding followed my proposed behavior, @dossett would have seen a diff later, rather than having the bindings be removed in step 3 (edit: I originally stated he would have seen a diff in step 3). His confusion appears to have again come from the fact that Terraform doesn't tell you when bindings are being removed during creation.

I pointed out that there will still be problems over competing IAM resources, but I believe that it's far easier for me to establish training that explain to folks "pay attention when bindings are being removed" than "don't ever use this resource without running import". I could codify CI checks (e.g. Sentinel) that warn folks when removing IAM principles, but I can't do that if Terraform reports no diff at all.

...we actually had this behaviour during the 1.X releases

My understanding (and I could be wrong!) is that the behavior you are referring to was that it could optionally be authoritative or additive with the authoritative parameter, is that correct? I would like to emphasize that what I am proposing is not that, and I wouldn't want that either, as I can see how that would be confusing. I only think this should be additive on create, and remain authoritative otherwise.

I've seen some discussion around guarding the behaviour behind a boolean...

I understand and agree, but I was mainly proposing this as solution that would allow for preserving behavior for what I believe is a small minority of users, the ones that really want their IAM principles blown away on create. I'd prefer my suggested behavior to be the only option.

I really do think that this would be a simple change that would net a huge win for the people I work with and our ability to use the authoritative versions of resources. If Terraform ever supported auto-import, this functionality easily could be reverted, but at least we'd have something in the interim to help prevent the kinds of issues I frequently see.

Otherwise our only recourse would be to potentially "ban" anything other than iam_member, which I think would be a big loss.

Anyhow, I really do appreciate the discussion and your time. Feel free to close this issue if you'd like, and tag me in any public GitHub issues with Hashicorp so I could follow along.

rileykarson commented 3 years ago

I understand why this would be the ideal, but I mentioned a case where it happens now, and it would continue to happen even with an auto-import functionality. I am sure you are well aware, but there will almost certainly always be some resources that cause delayed effects that cannot be expressed in a single apply.

Definitely! The difference here is introducing that behaviour intentionally, which I'd like to avoid.

dossett is actually a member of my company :). the confusion ... comes with competing IAM resources

Acknowledged! Misread on my part, you're right. I've seen other cases similar sentiment to this section in isolation in the past & I think it stands:

The initial apply doesn't given any indication of a problem but a second apply with no change made at all will suddenly remove lots of permissions. In our experience with this problem, the second apply will be made by a different person trying to make an unrelated change and they will not have any context for terraform's desire to remove these permissions.

My understanding (and I could be wrong!) is that the behavior you are referring to was that it could optionally be authoritative or additive with the authoritative parameter, is that correct?

That's how _policy resources behaved with authoritative, but I believe _binding actually had your described behaviour (see https://github.com/hashicorp/terraform-provider-google/issues/2449).

I think the summary of my current take is:

This is definitely a weak point in the provider and this proposal would help (but with some tradeoffs!)
Implementing this would be a breaking change (as it would now require a second terraform apply), meaning if it gets made we'd only make it as part of 4.0.0
A breaking change across all of the IAM resources (excl. _member) would be fairly extreme, and we're unlikely to make it as a result. We felt that the reaction to 3.0.0 indicated it was too large of an upgrade.
Auto-import behaviour would get us the intended behaviour (plan accurately listing the removals it will perform) without tradeoffs and would maintain compatibility

I'm assigning to the 4.0.0 to reevaluate when we get to major version planning.

melinath commented 3 years ago

https://github.com/hashicorp/terraform/issues/19017 is tracking work to implement some kind of read-before-create to solve exactly this kind of situation; anything on our end would be blocked by that work.

kaizenlabs commented 2 years ago

Can somebody help with this? I'm trying to create a service account and assign it a role, and it succeeds but it deletes all the default service accounts, including the Google API's Service Agent. How do we preservice the service agent? (@cloudservices.gserviceaccount.com)

rileykarson commented 2 years ago

@kaizenlabs: Are you using an iam_member resource, or iam_policy?

lukeschlather commented 2 years ago

One of my coworkers just reached out to me on Slack and said:

I'm not sure how we can enforce this but I think we should basically prevent anyone from ever using google_project_iam_binding , and instead people should use google_project_iam_member.

I hesitate to say this but _binding seems simply like a misfeature. Google sometimes adds permissions independently of terraform. It's never ok for terraform to overwrite bindings it didn't add itself. Is there some simple way to make a resource type totally banned in our project?

rileykarson commented 2 years ago

We could probably use heftier docs somewhere than an individual reference page to cover how to use the IAM resources- we expect _member to be preferred for most users (and, in fact, many of those users will consider _binding or _policy misfeatures) while _binding and _policy have more limited usage. For example, for some users,_binding's primary purpose is explicitly controlling the use of high-power roles like roles/owner or roles/editor (or a custom equivalent)

You'd want to use a policy validation layer like Sentinel, gcloud terraform vet, conftest, etc. to limit using _binding resources.

eriksw commented 2 years ago

Is there some simple way to make a resource type totally banned in our project?

That would be a use case for the Sentinel feature of Terraform Cloud.

Google sometimes adds permissions independently of terraform. It's never ok for terraform to overwrite bindings it didn't add itself.

I for instance rely on Terraform to ensure the revocation of dangerous permissions granted by Google to e.g. the default Compute Engine service account. (In most cases I accomplish this by using iam_policy, but in other situations I've explicitly relied on having an empty _binding for e.g. Owner and Editor.)

upodroid commented 2 years ago

Google sometimes adds permissions independently of terraform.

You shouldn't be managing roles/*.serviceAgent bindings in terraform at all. These are created by Google to allow products to work in your project. One of the reasons why we don't recommend the usage of google_project_iam_policy.
_binding is intentionally designed to drop IAM roles not specified in terraform. It allows you to be more deliberate about permissions you explicitly grant but a small number of well-known google managed service accounts are granted roles/editor that you need to pay extra attention to. project-number@cloudservices.gserviceaccount.com is a common one.
It is a good practice to get the IAM policy object (gcloud projects get-iam-policy SOMETHING) before creating _binding or _policy resources in terraform and act accordingly.

I for instance rely on Terraform to ensure the revocation of dangerous permissions granted by Google to e.g. the default Compute Engine service account. (In most cases I accomplish this by using iam_policy, but in other situations I've explicitly relied on having an empty _binding for e.g. Owner and Editor.)

This is no longer required now. There is an organization policy constraints/iam.automaticIamGrantsForDefaultServiceAccounts that skips granting permissions to the default App Engine and Compute Engine service accounts on new projects. Please use the right set of tools to achieve what you need.

hashicorp / terraform-provider-google

Usability improvements for *_iam_policy and *_iam_binding resources #8354

Community Note

Description

New or Affected Resource(s)

Potential Terraform Configuration

References

Usability improvements for _iam_policy and _iam_binding resources #8354