Consider automatically refreshing on preview/update/destroy

pulumi / pulumi

Pulumi - Infrastructure as Code in any programming language 🚀

https://www.pulumi.com

Apache License 2.0

20.91k stars 1.09k forks source link

Consider automatically refreshing on preview/update/destroy #2247

Open lukehoban opened 5 years ago

lukehoban commented 5 years ago

We support manually refreshing state from target cloud providers using pulumi refresh, but we do not run this refresh automatically. This leads to users frequently getting in a state where pulumi preview or pulumi update appear to present misleading guidance (because it is relative to the stale state of the Pulumi checkpoint, not the state the user knows their resources are actually in).

We have always considered whether we should move to a model where we run refresh by default prior to preview or update. This is more likely to ensure that proposed changes are fully accurate. Notably though, it likely will lead to a meaningful performance penalty as all resources will need to be refreshed prior to doing any diffs, including on previews.

Dominik-K commented 4 years ago

I strongly support running a refresh by default (and disabling it with a --refresh=false flag):

I think that's more intuitive & expected behavior. Newbies then automatically learn that there's a state which Pulumi uses as its source of truth.
It's trained behaviour for people coming from Terraform.
The refresh performance is getting faster, e.g. the last great Kubernetes provider refresh performance improvement.

sam-cogan commented 4 years ago

I would strongly advocate for making this the default behaviour. As a new user coming from Terraform this was confusing, but more importantly when working with some providers, namely Azure, not doing this results in breaking the idempotent nature of the tool, and causes some resource t be removed on a second apply. This is dangerous when your relying on the end user to make sure they apply a CLI flag.

lukehoban commented 4 years ago

not doing this results in breaking the idempotent nature of the tool, and causes some resource t be removed on a second apply

Do you have an example of a specific resource that has this problem? Would love to understand that better.

sam-cogan commented 4 years ago

Do you have an example of a specific resource that has this problem? Would love to understand that better.

The examples I've seen are around Azure where you can configure resources either as a sub resource, or their own resources, for example Network Security Group Rules and Subnets. If you configure theme as their own resource then when you run Pulumi Up a second time these get remove. This isn't really Pulumi's fault, it is down to a poor implementation of the Azure API which expects the top level object to contain all the sub resources, and yet still offers the ability to set them as sub resources. However, I never encountered this issue with Terraform as it always refreshes, when I started using Pulimi and just saw these resources disappear it was very confusing.

Whilst using the -r flag will fix this, I've now got to make sure that users do this.

samirageb commented 4 years ago

@lukehoban For an example of destructive, counter-intuitive behavior see https://github.com/pulumi/pulumi-azure/issues/572

lukehoban commented 3 years ago

We are strongly considering making a change to the default here as part of Pulumi 3.0.

ncsibra commented 3 years ago

If you do so, please make it configurable with env or a config file. Most of the time we don't need this and if possible, I don't want to type --refresh=false or something like that for every command. :)

imod commented 3 years ago

how about adding refresh as an option to Pulumi.yaml this way we could ensure it is used in a consistent way and would not have to pass it as an argument for every run.

GoodMirek commented 3 years ago

how about adding refresh as an option to Pulumi.yaml this way we could ensure it is used in a consistent way and would not have to pass it as an argument for every run.

This would be a nice improvement for me too. Is there any plan for implementing it?

pgavlin commented 3 years ago

how about adding refresh as an option to Pulumi.yaml this way we could ensure it is used in a consistent way and would not have to pass it as an argument for every run.

This is very interesting. @lukehoban @infin8x what do you think? Even if we do change the default behavior to refresh by default, refresh: false in Pulumi.yaml seems like a good knob to have.

lukehoban commented 3 years ago

how about adding refresh as an option to Pulumi.yaml this way we could ensure it is used in a consistent way and would not have to pass it as an argument for every run.

This is very interesting. @lukehoban @infin8x what do you think? Even if we do change the default behavior to refresh by default, refresh: false in Pulumi.yaml seems like a good knob to have.

Agreed - this is a great idea - and something we could add even before switching the default. We'd want to think a bit about the general mechanism for where to provide this setting - there are several other related things we're looking at adding to project and/or stack files - like backend, hierarchical config and others. Ideally we can design a general model that can include these and others.

jre21 commented 2 years ago

I'm honestly somewhat shocked that this is still an ongoing discussion three years after it was first raised. As a user of Pulumi, my expectation from the tool is that the tool will discover the ways in which the live environment differs from my desired state (as defined in source code) and remediate these changes. The fact that Pulumi keeps an internal cache of the state of external resources and needs to reconcile that cache in order to is, from the user perspective, an implementation detail.

Now, don't get me wrong: the way Pulumi stores its state is obviously very important to its functioning and certainly something that people who develop the tool need to be constantly aware of. But when actually using the tool, the more I need to reason about its internal state, the greater the likelihood that I'll make a mistake that causes a production outage. Having Pulumi automatically update its internal state before performing any action helps the tool to get out of my way and present fewer surprises to the user.

My team manages a complex AWS environment spread across several dozen stacks. Although our goal is for Pulumi to be the source of truth for the environment's configuration, resources do sometimes get changed out from under Pulumi, either because there was (possibly accidental) overlap between the resources that different stacks managed, or because some other process made a change. It's not a common occurrence, but that's part of the danger; resources get changed under us just infrequently enough that we might not think to watch for it. That's why reconciling internal state against the cloud provider's source of truth needs to be automatic, and in our case the additional safety we would gain from doing so certainly outweighs the additional time needed to perform the check. When modifying databases or kubernetes node groups, it's also not uncommon for a stack update to require ten minutes or more, which dwarfs the time needed for a refresh.

I understand how disruptive changing a default can be to users who have grown to expect the existing behavior, and acknowledge that this would need to be part of a major release where such sweeping changes are expected. I would, however, ask that you prioritize the addition of a configuration flag to set this behavior, as a less brittle workaround than requiring users who want this behavior to remember to set --refresh on every invocation.

ETA: the fact that StackReference resources automatically refresh on each update also primes users to expect that other resources will behave similarly.

infin8x commented 2 years ago

@jre21 thanks for sharing your thoughts and suggestions here, and sorry it's taken us a long while to get to this. Our north star in this area is to get to auto-refresh on every preview/update/destroy, as was originally requested in this issue. To your point, we'll need to do that at a major version release since it will be a "breaking"/high-impact change from our current behavior. I've logged it in https://github.com/pulumi/pulumi/issues/7346 so we'll make sure to discuss it later this year when we begin planning for Pulumi 4.0.

In the meantime, I appreciate your suggestion to "[add] a configuration flag to set this behavior". We agree that this is a reasonable step in the right direction and will work on https://github.com/pulumi/pulumi/issues/8058 in the next few weeks. If you like, consider subscribing to that issue to be notified when we get it shipped!

komalali commented 2 years ago

Hey folks! #8058 is now closed and will be included in the next release (next week). Take a look at the accompanying PR description for details on the added option and CLI overrides.

jre21 commented 2 years ago

Looking back on this issue, I think pulumi needs significant behavioral changes before automatic refreshes could provide a truly excellent experience.

From a design perspective, pulumi acts as if its internal state is just as valuable as the live environment, and that changes to either need to be handled with equal care. As a user of pulumi, I see the environment itself as ground truth, while pulumi's state should act more like a cache. This has many implications for how refreshes during other pulumi operations should be handled.

Any time pulumi performs a refresh, it asks for confirmation before writing changes to state. For most usecases, I would rather this write happen automatically. After all, the environment already changed, and a statefile which acknowledges the current environment is more useful than one which retains its own original data. In particular, pulumi preview --refresh discards the results of the refresh entirely (treating a write to the statefile as an active change) instead of retaining them (incorporating all available knowledge of the system's true state).
When pulumi displays diffs (e.g., pulumi preview --diff --refresh) the visual style gives very limited indication of which changes resulted from the refresh versus which are proposed updates. I would rather see a clearly visible separator, and perhaps have a "refresh" diff use different colors than an "update" diff. While it can be useful to know when changes have previously been made, it's vital to know which changes pulumi will make.
From what I've observed, pulumi up --refresh actually checks the state of external resources twice -- first during the preview (without writing to the statefile) and again at the beginning of the apply phase. I'd actually rather see pulumi only perform the first refresh (and ideally write the statefile changes even if the update is rejected). This would provide a minor time optimization, but more importantly, it provides better consistency guarantees. Once I confirm a pulumi update, I want it to perform the changes it said it would enact, and the current implementation makes me worry that, if the second refresh returned different information than the original, pulumi might take actions (on live resources) which weren't shown in the preview. And when running pulumi against a live environment, I often need to plan for the changes that occur just as much as I need to plan out the target state.

Due to all these considerations, for particularly delicate changes I've tended to start with a standalone pulumi refresh and then run the remaining operations without the --refresh flag. It gets the behavior I want, but, like any manual process, increases the risk of operator error. Pulumi would be friendlier to its users if its automated behavior were more closely matched to their view of the world. I hope you'll take these principles into consideration when moving towards auto-refresh by default.

davide-romanini-milkman commented 1 year ago

I would like to point to attention that sometimes refresh seems to do bad things. Take as an example aws.ec2.SecurityGroup and aws.ec2.SecurityGroupRule, with the second attached to the first. When I refresh this kind of tightly related resources, I find that all the attached ingress/egress rules defined by aws.ec2.SecurityGroupRule, become part of the state of aws.ec2.SecurityGroup!. On the next pulumi up, all the rules are removed, requiring another refresh to pick up the deleted rules, then pulumi up again. This obviously causes downtimes, so I would strongly avoid to make refreshing the default behavior. Ideally, refresh should be done very infrequently, because manual operations should be avoided at all costs, except for emergencies, that by definition don't happen all the time.

gunzy83 commented 1 year ago

I have come across this issue while looking into drift detection methods. I would like to add a downvote to this proposition with Pulumi in its current form. Without changes to how Pulumi displays a change to the user, I see this change causing much confusion for people who are not the primary operators of this tool (primary generally being cloud engineering/devops roles).

Background

I joined a small development team already using Pulumi, and it is now my favourite IaC tool. With this development team I found early on that a change would be made with dev as the target for testing, PR review and then merge (no deploy). For the main app deployment code this was fine as the inventory would not build up (automated deploys) but on other projects for supporting infrastructure if the author was not ready to deploy their change through staging or production that inventory would build up and every time another engineer went to deploy the project in question there would be a bunch of changes not yet deployed. I would be pulled in to work out why these extra changes were in their preview. This is part of the way fixed for us via CD on merge now but adding a refresh to PR and merge Pulumi runs would cause the same confusion and panic all over again as updates to manually changed resources and weird issues with refreshed resources creep into the normal workflow. Imagine trying to explain to a frontend engineer why their one line change to an environment variable on their app resulted in a 7 resource preview diff on their PR.

My Thoughts

Now I may be off base here (Pulumi's CTO raised this issue) but I have operated with this mental model for IaC in general for some time now with multiple tools (Pulumi actually does it best):

The code represents desired state (in this case code -> Pulumi's generated DAG).
The state data structure represents the "as built" state of the infrastructure at last deployment (this is not just a "cache" as others have mentioned IMO).
The real queried state from providers represent the actual state of the infrastructure.

If you always query the current real state and update the as built state by default you are wiping out the record of what the previous operation on that stack applied and lumping remediation of drift in with the set of expected changes from a commit/code change making root cause analysis more difficult if there is an issue in that deployment. Not to mention the weird issues seen with some resources @davide-romanini-milkman mentioned above (AWS Security Groups and Route Tables come immediately to mind).

I thought about this for a couple of weeks and I was dead set against this before, but after some thought I would not oppose a change to make this default if the UX was improved to display changes that were drift remediation and those that are coming from a code change. Even then, I think I would still use the configuration option to turn it off on some stacks and run separate drift remediation scheduled jobs in CI. I love the idea of continual reconciliation of actual state to desired state but I think that comes with a level of maturity and the current default provides guard rails for teams who may not be ready for that yet.

jre21 commented 1 year ago

I definitely agree that pulumi's preview needs to make a clear visual distinction between drift remediation and proposed changes, and would like to see the two types of changes rendered in entirely different colors. I also agree that switching to a refresh-by-default model without making those UX changes would cause a great deal of confusion and likely make the experience of using pulumi worse overall, rather than better.

What I do not agree with, is the idea that running an update without first refreshing pulumi's state is the safer approach. There are many reasons why external processes might change resource state out from under pulumi, and pulumi cannot transform infrastructure into the correct end state if it has an incorrect understanding of the initial state.

As to the unfortunate behavior of the SecurityGroup and RouteTable resources: I don't want to minimize the pain of people who have had security group rules deleted when they weren't expecting it, but the fact of the matter is that it's the result of a poorly-designed resource (although it was written that way for valid historical reasons), not a flaw in the refresh functionality. The safer way to use security groups is to create SecurityGroupRule resources instead of using the ingress and egress attributes on SecurityGroups. Pulumi and Terraform's documentation goes to some effort to warn about the sharp edges of the latter interface, but probably not enough, and it probably would have been better for everyone if inclusion of security group rules directly in the SecurityGroup resource were explicitly deprecated (and likewise for route tables).

gunzy83 commented 1 year ago

I definitely agree that pulumi's preview needs to make a clear visual distinction between drift remediation and proposed changes, and would like to see the two types of changes rendered in entirely different colors. I also agree that switching to a refresh-by-default model without making those UX changes would cause a great deal of confusion and likely make the experience of using pulumi worse overall, rather than better.

100% agree.

What I do not agree with, is the idea that running an update without first refreshing pulumi's state is the safer approach. There are many reasons why external processes might change resource state out from under pulumi, and pulumi cannot transform infrastructure into the correct end state if it has an incorrect understanding of the initial state.

I agree that to be 100% pure and correct the system performing the updates needs to know the initial state to perform the updates. I just don't agree that overwriting state before every deploy is correct either. It is then treating the current queried state as the state the program saw at the end of the last deploy and if the deploy fails half way through you then lose the context of the change (drift remediation vs code change). Only applying state changes as resources actually change to me seems more correct (unless explicitly running a refresh to purposely match state to the real world).

I just think we have a fundamental disagreement on how we view state (I am assuming this because you didn't address my thoughts on state being more than just a cache). Would love to see one of the Pulumi authors/maintainers chime in on this one.

As to the unfortunate behavior of the SecurityGroup and RouteTable resources: I don't want to minimize the pain of people who have had security group rules deleted when they weren't expecting it, but the fact of the matter is that it's the result of a poorly-designed resource (although it was written that way for valid historical reasons), not a flaw in the refresh functionality. The safer way to use security groups is to create SecurityGroupRule resources instead of using the ingress and egress attributes on SecurityGroups. Pulumi and Terraform's documentation goes to some effort to warn about the sharp edges of the latter interface, but probably not enough, and it probably would have been better for everyone if inclusion of security group rules directly in the SecurityGroup resource were explicitly deprecated (and likewise for route tables).

This is exactly the mode of operation @davide-romanini-milkman and I were talking about (using the external Rules resources rather than applying them at the base resource). Just did some quick testing with updated packages and it appears changes rolled in from Terraform in 5.22.0 of the AWS provider fix security groups and aws.ec2.Route resources seem to work fine now well though I don't have time to track down where/when that got fixed.

The aws.ec2.Tag resource breaks this pattern if you apply tags at the provider level, it results in repeated removal and reinstating of the tag when refreshing. Mixing provider and direct resource tags seems to work fine in the same stack and adding additional tags outside (EKS automatic discovery tags come to mind) can be done by adding ignoreTags with keyPrefixes to the provider. This however is best documented in the Terraform docs, not Pulumi's.

There are others, particularly in Kubernetes, where you get eye watering levels of garbage state updates on each refresh which in my experience had resulted in changes on many subsequent up... and let's not talk about Admission Controllers... I see the issue you raised, I gave it a thumbs up.

Good to see some things fixed and that we are able to find workarounds, but I still think this will be massively jarring if made to be the default without concerted effort to prepare the community (documentation and comms) and improve UX. At least the config option is there for those that want it right now, which means the change to the default does not have to be rushed and can be used to help get everyone ready if this change does go through.

Huffers commented 1 month ago

Just adding a note to say that I recently tried to roll out a change to an Azure Network Security Group's tag via Pulumi - a very simple, innocuous change which should not have required downtime, but Pulumi actually reverted the NSG rules to a previous version which had been (unbeknownst to me) stored inline in the NSG resource's definition in the production infrastructure's state file. This changed which ports were allowed through and blocked all requests to the application.

Investigating why requests were failing was tricky because "pulumi preview" had said it was only going to change a tag, and "pulumi deploy" claimed that it did only change a tag. Looking in the Azure Resource Group's activity group showed what pulumi had actually done, but it took a while to find that as I wasn't expecting Pulumi to say one thing but do something else.

This wasn't caught during testing on my staging environment, because even though production and staging were both on version X and I tested out going from X to X+1 on staging (and it worked as expected), going from version X to X+1 on production did something completely different because production had these extra, "hidden" NSG rules inline in the NSG resource definition in the Pulumi state file, but not in the staging one - despite the infrastructure itself and version of the code used to deploy them being identical. This makes this kind of bug somewhere between difficult and impossible to find during testing. I eventually traced the difference in the content of the stack files as being due to the staging environment having been torn down and redeployed at some point, leading to the old copy of the inline NSG rules getting cleared from its stack file.

This has lead to the conclusion that running pulumi without the "refresh" option is a dangerous operation, gaining a little speed at the cost of not knowing what pulumi will actually do or being able to trust that it has done what it says it did. As a result, I would like to always run pulumi with the refresh option, however because it isn't the default it seems that not all the providers are particularly well tested with it (Kubernetes in particular).

So in practice... for now you just have to remember that Pulumi might sometimes make unexpected pseudo-random unasked-for changes that don't show up in preview, and that it doesn't tell you it made, even if you've thoroughly tested your code and tried out deploying your updates on identical infrastructure.

Edit: Even if you only ever deploy or update your infrastructure via Pulumi.

gunzy83 commented 2 weeks ago

Investigating why requests were failing was tricky because "pulumi preview" had said it was only going to change a tag, and "pulumi deploy" claimed that it did only change a tag. Looking in the Azure Resource Group's activity group showed what pulumi had actually done, but it took a while to find that as I wasn't expecting Pulumi to say one thing but do something else.

We had a similar issue with EKS Managed Node Groups, we had a version upgrade roll out and someone had made manual changes to desired capacity. Without a refresh the nodes were scaled down to the value in state because the underlying API call was made with all the parameters including the original capacity in state, a change which Pulumi did not present in the previews or updates. This caused downstream issues with scheduling of pods until we increased capacity and then updated the configuration in the codebase.

This has lead to the conclusion that running pulumi without the "refresh" option is a dangerous operation

Is the person who makes changes outside of IaC not doing the actual dangerous operation? This is where detecting drift often and early is valuable IMO (we have some scheduled actions runs that do this now).

As a result, I would like to always run pulumi with the refresh option, however because it isn't the default it seems that not all the providers are particularly well tested with it (Kubernetes in particular).

You can turn this on per stack currently via config. You are correct that the providers are not really ready for this but I think it is improving over time.

I agree that the current situation where changes may not be presented in the Pulumi output is not great (mainly due to how the underlying cloud APIs work), I still think that program output changes are required before this default is changed. We need to ensure that a user can differentiate between their changes (submitted via code changes and PR process) and drift remediation.

Huffers commented 2 weeks ago

Is the person who makes changes outside of IaC not doing the actual dangerous operation?

In my example nobody had made changes outside of IaC.

Pulumi refresh is needed even if you only ever use Pulumi to make changes to your infrastructure.

I think perhaps I should explain this a bit better with an example:

Imagine you use Pulumi to create a Network Security Group with a Rule with port 80 open. After you run "pulumi up" your stack file looks like this pseudocode:

Resources:
    - Type: NetworkSecurityGroup
      Tags:
       - Name: Description
         Value: Some description
      Rules:
        - RuleType: Allow
          Port: 80
    - Type: NetworkSecurityGroupRule
      RuleType: Allow
      Port: 80

Note the rule is included twice - inline inside the NSG and separately as its own resource, even if the rule was only defined once in your Pulumi code. Then lets say you update the NSG rule to allow port 443 instead. You run "pulumi up" and pulumi only updates that rule, not the NSG. Your stack file now looks a bit like this:

 Resources:
    - Type: NetworkSecurityGroup
      Tags:
       - Name: Description
         Value: Some description
      Rules:
        - RuleType: Allow
          Port: 80
    - Type: NetworkSecurityGroupRule
      RuleType: Allow
      Port: 443

Note that because only the rule was updated, the inline redundant copy of the rule still lists the old port as open. Now what happens if we update the "Description" tag on the Network Security Group and run "pulumi up"? Pulumi says it's only changing the NSG's description, and sends across the following to Azure:

    - Type: NetworkSecurityGroup
      Tags:
       - Name: Description
         Value: New Description
      Rules:
        - RuleType: Allow
          Port: 80

Thus making surprising changes without saying what it's going to do in preview or what it did, and without anyone or anything having made any changes to your infrastructure outside of Pulumi.

gunzy83 commented 2 weeks ago

Pulumi refresh is needed even if you only ever use Pulumi to make changes to your infrastructure.

Thanks for clarifying. You are correct that a refresh would uncover this as a misconfiguration. I see that the Azure provider docs have an explicit warning just like the AWS resources (EC2 Security Groups, Route Tables and others):

It is a bit sad that these foot-guns still exist (in both AWS and Azure) and I think inline configuration of rules should be deprecated as per jre21's post. I am still not convinced that overwriting state each time you update is the right thing to do without output to differentiate between the types of changes.