pulumi / pulumi-aws

An Amazon Web Services (AWS) Pulumi resource package, providing multi-language access to AWS
Apache License 2.0
445 stars 154 forks source link

SQS QueuePolicy update strategy is incorrect #1923

Open SeanSanker opened 2 years ago

SeanSanker commented 2 years ago

What happened?

I recently changed the binding of a parent/child relationship for an aws.sqs.QueuePolicy. When it was updated, it created a replacement and deleted the old one.

Turns out that they’re not distinct entities, they’re just a property of the aws.sqs.Queue.

That means creating a replacement then deleting the prior essential just deleted the property all together, thus breaking our workflow and losing a few hours of messages for the Queue.

This seems to be a bug in the update strategy.

Steps to reproduce

  1. Create a Queue and an attached QueuePolicy.
  2. Deploy them
  3. Rename the QueuePolicy and/or change the { parent } property
  4. Deploy again

Expected Behavior

The QueuePolicy is updated with the new definition, be it in-place, or by a create/delete replacement strategy.

Actual Behavior

  1. During reproduction steps 1-2, the entities are created.
  2. During reproduction step # 4
    1. The strategy chosen is to replace: create the new QueuePolicy, then delete the old one.
    2. The new QueuePolicy is created
    3. The old QueuePolicy is deleted (which is the exact same entity as the new one)
    4. The Queue ends up with no QueuePolicy at all.

Versions used

CLI Version 3.30.0 Go Version go1.18.1 Go Compiler gc

Plugins NAME VERSION nodejs unknown

Host OS darwin Version 12.3.1 Arch arm64

Additional context

The resolution ended up being:

  1. Run pulumi refresh so it detected that the policy had been deleted
  2. Run pulumi up to re-create the entity/property.

Slack discussion here: https://pulumi-community.slack.com/archives/C84L4E3N1/p1650636402241399

The suggestion from @jaxxstorm was to use an alias. See below: ...when you say you changed the parent child relationship, you mean as a resource option? If so, this is expected behaviour and you can use an alias to prevent a replacement

I don't think that using an alias is the correct solution here.

An alias should be used when we want to retain an existing entity for some intentional reason. I think that's a workaround, but not a solution to the core issue that I see: QueuePolicy has a faulty update strategy as it identifies a QueuePolicy as a separate entity rather than an attribute of a Queue

The issue is that the default contract of pulumi's functionality: changing code then performing a deploy should update the state of the environment without breaking it is broken due to this issue.

Docs on policies for SQS Queues: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-authentication-and-access-control.html#access-control Amazon SQS has its own resource-based permissions system that uses policies written in the same language used for AWS Identity and Access Management (IAM) policies. This means that you can achieve similar things with Amazon SQS policies and IAM policies.

Docs on setting the Policy attribute for a Queue: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/APIReference/API_SetQueueAttributes.html

To confirm for yourself that it's a property run the following on a Queue with a Policy set: aws sqs get-queue-attributes --queue-url [some url with a policy] --attribute-names Policy

Contributing

Vote on this issue by adding a 👍 reaction. To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

SeanSanker commented 2 years ago

I'd also be more than happy to help fix this, but I don't know where to start or how these strategies are defined/implemented. Let me know if someone wants to hop on a call and work through this with me and I'll try and resolve these kinds of issues myself (if possible) in the future.

AaronFriel commented 2 years ago

@SeanSanker digging in to this, I think I understand the surface area here and have an explanation for what happened. This first comment will just just attempt to describe the status quo, as I'm not saying the behavior you saw is desirable, it's never great to lose data.

Looking at queuePolicy, there's only one way to rename it, which would be to change that first argument to the resource, what we call a "logical name".

new aws.sqs.QueuePolicy("first-name", 
                       # ^ logical name
  { /* args */ },
  { parent: /* some parent resource */ },
);

I hope I can describe the engine's present behavior and scratch at the surface of why it occurs like this though, from the Resource Names docs:

Each resources in Pulumi has a logical name and a physical name. The logical name is how the resource is known inside Pulumi, and establishes a notion of identity within Pulumi even when the physical resource might need to change (for example, during a replacement). The physical name is the name used for the resource in the cloud provider that a Pulumi program is deploying to.

Pulumi auto-names most resource by default, using the logical name and a random suffix to construct a unique physical name for a resource. Users can provide explicit names to override this default.

Each resource also has a Uniform Resource Name (URN) which is a unique name derived from both the logical name of the resource and the type of the resource and, in the case of components, its parents.

Consider this situation:

pulumi up on this:

const queue1 = new aws.sqs.Queue("first-queue");
new aws.sqs.QueuePolicy("first-queue-policy", { 
  queueUrl: queue1.id, /* args */ 
});

Then we change the program like so:

 const queue1 = new aws.sqs.Queue("first-queue");
-new aws.sqs.QueuePolicy("first-queue-policy", {
+new aws.sqs.QueuePolicy("new-name-for-first-queue-policy", {
   queueUrl: queue1.id, /* args */ 
 });
+
+// A second queue added in the same program:
+const queue2 = new aws.sqs.Queue("second-queue");
+new aws.sqs.QueuePolicy("second-queue-policy", { queueUrl: queue2.id, /* args */ });

The question the engine has to ask the provider:

  1. Do we delete first-queue-policy
  2. Do we update first-queue-policy using the args of new-name-for-first-queue-policy
  3. Do we update first-queue-policy using the args of second-queue-policy

To support all 3 options, you could:

  1. Run the second program as-is. This is the default behavior.
  2. To rename the first queue policy's logical name without deletion, use the alias resource option:
 const queue1 = new aws.sqs.Queue("first-queue");
new aws.sqs.QueuePolicy("new-name-for-first-queue-policy", {
    queueUrl: queue1.id, /* args */ 
  }, {
   aliases: [{ name: 'first-queue-policy' }],
  });
// ... second queue as-is
  1. And likewise for the 3rd case:
new aws.sqs.QueuePolicy("second-queue-policy", { queueUrl: queue2.id, /* args */ }, {
  aliases: [{ name: 'first-queue-policy });
AaronFriel commented 2 years ago

@SeanSanker For this second comment, you asked during the community forum whether you could contribute to fix this. I'm sad to say this would be a heck of a change for a first time contributor. I'm not sure if it's impossible though, as you note this is the surprising behavior:

  1. The new QueuePolicy is created
  2. The old QueuePolicy is deleted (which is the exact same entity as the new one)
  3. The Queue ends up with no QueuePolicy at all.

That's very very tricky. "QueuePolicy" is kind of a strange resource, (EDITED) our oldest issues describe this as a "structural resources". It doesn't have a name and it shouldn't really be tracked by its name. It's something we've inherited from another, unnamed infrastructure as code tool. :) It's not really a distinct resource, it's just an alias for a property or set of properties on another resource.

I think that this comment and the many 👍 reactions to it exemplifies why these sorts of resources behave in surprising ways. One way we're fixing this in our native providers is by not modeling these pseudo-resources. As you note, a "Queue" and its "QueuePolicy" are really the same thing, and they should be managed together as a unit. We're doing that in our native provider, aws-native, documented here:

https://www.pulumi.com/registry/packages/aws-native/api-docs/sqs/queue/

We've seen this behavior before with other (EDITED) structural resources, and I wonder if we can improve on this by more eagerly deleting resources safely. I'm going to take this to the team to ask some deeper questions and get back to you.

AaronFriel commented 2 years ago

@Frassle got back to me quite quickly and confirmed what I thought. There's a deep vein of related issues we want to tackle and are mulling over the right approach. The oldest issues we've identified are these:

And more recently and in this repo:

I'm linking to these to ensure that we appropriately cross-reference this with these other issues.

t0yv0 commented 9 months ago

Hi everyone, I'm picking this up in the upcoming iteration and looking at some options.

Naively, both the Pulumi engine (embedded in Pulumi CLI) and the provider pulumi-aws jointly participate in planning how to execute the update operation. In particular the provider has this response option in the Diff method:

type DiffResponse struct {
    DeleteBeforeReplace bool                     `protobuf:"varint,3,opt,name=deleteBeforeReplace,proto3" 

I'm wondering if perhaps we could make the provider enforce that for QueuePolicy always, just based on return type.. so all replaces are indicated by the provider to need to execute in "delete, create" order.

Let me experiment and see if this can help get past the particular issue here, also curious generally if that has some adverse effects and would be undesirable (outside of this issue in particular).

t0yv0 commented 9 months ago

import * as pulumi from "@pulumi/pulumi";
import * as aws from "@pulumi/aws";

const queue = new aws.sqs.Queue("queue");

// Fetch the configuration for the current stack
let config = new pulumi.Config();

let step = config.require("step");

const testPolicyDocument = queue.arn.apply(arn => aws.iam.getPolicyDocumentOutput({
    statements: [{
        sid: "First",
        effect: "Allow",
        principals: [{
            type: "*",
            identifiers: ["*"],
        }],
        actions: ["sqs:SendMessage"],
        resources: [arn],
        conditions: [{
            test: "ArnEquals",
            variable: "aws:SourceArn",
            values: [queue.arn],
        }],
    }],
})).json;

const parentBucket = new aws.s3.Bucket("parent-bucket", {});

const queuePolicy = new aws.sqs.QueuePolicy("first-queue-policy", {
    queueUrl: queue.id, // changing this one will replace too.
    policy: testPolicyDocument,
}, step == "initial" ? {} : {parent: parentBucket});

export const queueId = queue.id;
#!/usr/bin/env bash

set -euo pipefail

export PULUMI_CONFIG_PASSPHRASE=12345
export AWS_PROFILE=devsandbox

pulumi destroy --yes

pulumi config set step initial
pulumi up --skip-preview --yes

aws sqs get-queue-attributes --queue-url $(pulumi stack output queueId) --attribute-names Policy  # non-empty

pulumi config set step replacement
pulumi up --skip-preview --yes

aws sqs get-queue-attributes --queue-url $(pulumi stack output queueId) --attribute-names Policy  # unexpectedly empty

Still reproduces with this code on v6.13.3 aws and v3.97.0 pulumi CLI.

t0yv0 commented 9 months ago

Alright, I have some more insight on tracing this through PULUMI_DEBUG_GRPC="$PWD/log.json" logs. It turns out that in the case of a reparented QueuePolicy the Diff method is not being called on the QueuePolicy so trying to fix the Diff method in the provider is moot. Essentially in this case the URN of the QueuePolicy is changing and the engine considers these as two separate resources (before the reparenting and after reparenting), and fails to identify the one resource. So the engine does not realize it is replacing, but rather thinks it's creating a new QueuePolicy and then deleting the old one. Creating before deleting is the generic policy that maximizes uptime compared to deleting and then creating.

I don't think that using an alias is the correct solution here.

Unfortunately there is no clean solution I can think of for the provider to signal the correct resource identity to the engine. Adding an alias does seem to be an appropriate workaround for this situation. An alias makes the engine better aware of the resource identity.

Using this correction to the above code fixes the repro:

const queuePolicy = new aws.sqs.QueuePolicy("first-queue-policy", {
    queueUrl: queue.id, // changing this one will replace too.
    policy: testPolicyDocument,
}, step == "initial" ? {} : {parent: parentBucket, aliases: [{parent: pulumi.rootStackResource}]});
mikhailshilkov commented 4 months ago

@t0yv0 Could you please add a comment on what this issue is blocked on? (a link to an upstream issue or similar)

AaronFriel commented 4 months ago

@t0yv0 delete before replace is a partial mitigation. The current behavior leads to an incorrect state and durable downtime, until a refresh & up. Using delete before replace will result in momentary downtime (ideally sub-second, but... hard to say).

A fuller solution for these resources, which you've described as sidecars, likely relies on addressing structural resources.

t0yv0 commented 4 months ago

Thanks for pinging on this @mikhailshilkov I think there has been some progress but we still need design work to align on what fix is best, and possibly correlate more issues with these problems before attempting a fix. I'm removing the blocked label as we agreed we are going to reserve that for being unable to make progress through org-external dependencies.

@AaronFriel is correct in pointing out that DeleteBeforeReplace and so on are not satisfactory solutions.

To summarize current understanding, the problem is "the case of missing alias" where Pulumi is handling two distinct resources that map to the same physical underlying resource and executing Create(r1), Delete(r2) which ends up re-creating physical resource p1 and then deleting the same physical resource p1. The recommended workaround is to use Pulumi Aliases so that Pulumi can identify the resources as same. This can work around the problem but is not obvious to the user and still not entirely satisfactory.

Here's a quick summary of possible paths to resolution here:

  1. A provider-local fix would be changing Create from succeeding to re-create a resource to failing right away. This at least prevents Pulumi from making progress to the Delete, but the error message may not be guiding the user appropriately on how to use aliases. Something like this was attempted in https://github.com/pulumi/pulumi-aws/issues/2009 and creates some unfortunate consequences https://github.com/pulumi/pulumi-aws/pull/3235 - Pulumi may struggle to recreate the resource even when it needs to for good reasons.

  2. @blampe put up a proposal to change the engine behavior to avoid the Delete https://github.com/pulumi/pulumi/issues/15982 - there were some small-ish objections but perhaps this could be made to work and I suspect it could solve the problem by eliding the Delete

  3. I've put up https://github.com/pulumi/pulumi/issues/16004 proposal that would allow the provider to monitor incoming plans during preview, including planned deletes, and emit a warning if the plan involves Creating and then Deleting two resources that map to what provider believes is the same physical infrastructure. Provider could then emit a nice message suggesting aliases, but there's some significant downsides to this proposal as to whether this functionality actually belongs in the provider as it only sees a slice of the actual plan and wouldn't work reliably in case multiple provider processes are involved.

@lukehoban requested further design work to flesh out the implications here.

mikhailshilkov commented 4 months ago

Does delete-before-replace even do anything in the scenario? If not aliased, we are not replacing a resource, we are creating and new one and deleting an old one, so we’d need to run all deletions before any creations for this to work?

Frassle commented 4 months ago

so we’d need to run all deletions before any creations for this to work?

Aye, and we can't do that (see https://github.com/pulumi/pulumi/issues/2877)