pulumi / pulumi

Pulumi - Infrastructure as Code in any programming language 🚀
https://www.pulumi.com
Apache License 2.0
21.65k stars 1.11k forks source link

Snapshot integrity failure after updating Azure elastic pool #17511

Open ashlynshatos opened 1 week ago

ashlynshatos commented 1 week ago

What happened?

I ran a pulumi update intending to scale our existing Azure SQL elastic pool. The only change was to the pool's SKU. After the scaling operation completed, the CLI failed with the output below. I've tried pulumi refresh resulting in the same error.

Leading up to this, I ran a pulumi refresh to resolve some unrelated differences in some appsettings, and a pulumi update scaling two other servers, all successful. The database mentioned in the error was moved from [hyperpool-01] to [hyperpool-03] in an update several weeks ago. I assume that has something to do with the error, but I thought that the successful refresh meant that I was starting from a good place.

As things sit now, our actual resources are all in the expected state. But with the CLI refusing to work with the snapshot, we can't make any further updates and I'm not sure where to go from here.

Example

CLI Output

error: The Pulumi CLI encountered a snapshot integrity error. This is a bug!

================================================================================
We would appreciate a report: https://github.com/pulumi/pulumi/issues/

Please provide all of the text below in your report.
================================================================================
Pulumi Version:    v3.135.1
Go Version:        go1.23.2
Go Compiler:       gc
Architecture:      amd64
Operating System:  windows
Command:           C:\Program Files (x86)\Pulumi\pulumi.exe update -s core.prod8 -t urn:pulumi:core.prod8::core::azure-native:sql/v20221101preview:ElasticPool::[hyperpool01]
Error:             writing snapshot: failed to save snapshot: .pulumi/stacks/core.prod8.json: snapshot integrity failure; it was already written, but is invalid (backup available at .pulumi/stacks/core.prod8.json.bak): resource urn:pulumi:core.prod8::core::azure-native:sql/v20221101preview:Database::[database]'s dependency urn:pulumi:core.prod8::core::azure-native:sql/v20221101preview:ElasticPool::[hyperpool01] comes after it

Stack Trace:

goroutine 3653 [running]:
runtime/debug.Stack()
        /opt/hostedtoolcache/go/1.23.2/x64/src/runtime/debug/stack.go:26 +0x5e
github.com/pulumi/pulumi/pkg/v3/resource/deploy.SnapshotIntegrityErrorf({0x2e0f70e?, 0x72?}, {0xc000f77d70?, 0x72?, 0x40?})
        /home/runner/work/pulumi/pulumi/pkg/resource/deploy/snapshot.go:611 +0x34
github.com/pulumi/pulumi/pkg/v3/resource/deploy.(*Snapshot).VerifyIntegrity(0xc006ff8140)
        /home/runner/work/pulumi/pulumi/pkg/resource/deploy/snapshot.go:497 +0x1053
github.com/pulumi/pulumi/pkg/v3/backend/diy.(*diyBackend).saveStack(0xc00042ab40, {0x3520e28, 0x4bf2fa0}, 0xc001252e40, 0xc006ff8140)
        /home/runner/work/pulumi/pulumi/pkg/backend/diy/state.go:315 +0x13f
github.com/pulumi/pulumi/pkg/v3/backend/diy.(*diySnapshotPersister).Save(0xc006ff8140?, 0xc007203ea0?)
        /home/runner/work/pulumi/pulumi/pkg/backend/diy/snapshot.go:35 +0x2b
github.com/pulumi/pulumi/pkg/v3/backend.(*SnapshotManager).saveSnapshot(0xc000168400)
        /home/runner/work/pulumi/pulumi/pkg/backend/snapshot.go:680 +0x82
github.com/pulumi/pulumi/pkg/v3/backend.(*SnapshotManager).unsafeServiceLoop(0xc000168400, 0xc0028ff0a0, 0xc0028ff180)
        /home/runner/work/pulumi/pulumi/pkg/backend/snapshot.go:733 +0xc5
created by github.com/pulumi/pulumi/pkg/v3/backend.NewSnapshotManager in goroutine 1
        /home/runner/work/pulumi/pulumi/pkg/backend/snapshot.go:769 +0x239

Output of pulumi about

CLI Version 3.135.1 Go Version go1.23.2 Go Compiler gc

Plugins KIND NAME VERSION resource azure-native 2.56.0 resource azuread 5.53.3 resource command 0.11.1 language dotnet unknown

Host OS Microsoft Windows 11 Enterprise Version 10.0.22631 Build 22631 Arch x86_64

This project is written in dotnet: executable='C:\Program Files\dotnet\dotnet.exe' version='8.0.400'

Backend Name [redacted] URL azblob://stacks User [redacted] Organizations Token type personal

Dependencies: NAME VERSION Pulumi 3.66.1 Pulumi.AzureAD 5.53.3 Pulumi.AzureNative 2.56.0 Pulumi.Command 0.11.1

Additional context

This is an excerpt from the snapshot of the database in question. I'm not sure why [hyperpool-01] is a dependency when the poolId is [hyperpool-03], but I'm not confident I fully understand what I'm looking at here. Could it be as simple as replacing [hyperpool-01] here with [hyperpool-03]?

{
    "urn": "urn:pulumi:core.prod8::core::azure-native:sql/v20221101preview:Database::[database]",
    "custom": true,
    "id": "[redacted]",
    "type": "azure-native:sql/v20221101preview:Database",
    "inputs": {
        "databaseName": "[database]",
        ...
    },
    "outputs": {
        "elasticPoolId": "**[hyperpool-03]**",
        "highAvailabilityReplicaCount": 0,
        "id": "[redacted]",
        "type": "Microsoft.Sql/servers/databases",
        ...
    },
    "parent": "urn:pulumi:core.prod8::core::pulumi:pulumi:Stack::core-core.prod8",
    "dependencies": [
        "[resourceGroup]",
        "**[hyperpool-01]**",
        "[server]"
    ],
    "provider": "urn:pulumi:core.prod8::core::pulumi:providers:azure-native::default_2_56_0::c9c72609-7074-47de-a1ab-c85d32fb2537",
    "propertyDependencies": {
        "elasticPoolId": [
            "**[hyperpool-01]**"
        ],
        "resourceGroupName": [
            "[resourceGroup]"
        ],
        "serverName": [
            "[server]"
        ]
    },
    "created": "2024-06-17T16:41:26.4487862Z",
    "modified": "2024-10-07T23:48:23.5811128Z",
    "ignoreChanges": [
        "availabilityZone",
        "catalogCollation",
        "collation",
        "highAvailabilityReplicaCount",
        "isLedgerOn",
        "licenseType",
        "maintenanceConfigurationId",
        "readScale",
        "sku",
        "zoneRedundant",
        "minCapacity"
    ]
}

Contributing

Vote on this issue by adding a 👍 reaction. To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

justinvp commented 1 week ago

Hi @ashlynshatos, thanks for opening the issue and really sorry for the trouble! We're looking into what caused this. We may have some follow-up questions for you, as we try to root cause the underlying problem.

In the meantime, to fix your stack, we just shipped a new command in v3.136.0 that can be used to repair your stack. Please upgrade your Pulumi CLI and try running pulumi state repair.

ashlynshatos commented 1 week ago

That CLI update was fortunate timing for me. I ran pulumi state repair and that got us unblocked. It moved the [hyper-01] pool up before everything that references it. When I was browsing the state file, I noticed that every database that we've moved between these pools still has the original pool listed as a dependency. It seems like this didn't come up before just because they happened to be in an order that made the dependency valid.

I'm happy to answer any follow-ups I can. Thanks for the help!

lunaris commented 1 week ago

Hi @ashlynshatos. Great news that state repair unblocked you! I'm wondering if based on your previous comment it's possible to reproduce this issue again then -- it sounds like if you were to create two fresh pools, A and B, and move a database from A to B, we'd expect the database to have references to both A and B, is this correct? Is this something you're able to test do you know? If not then I will see if there's a way I can replicate that issue myself.

ashlynshatos commented 1 week ago

Hello! I was able to reproduce this today and narrowed it down to our specific (maybe questionable) workflow. For various reason, our team will occasionally make changes like this one manually in Azure and only catch Pulumi up after the fact. Moving a database from pool A to pool B using only pulumi update works as expected when I tested. But, running a pulumi refresh after making a manual change doesn't fully update the database dependencies to reflect reality.

The steps look like this:

  1. Create 2 pools A and B with a database in pool A using Pulumi
  2. In the Azure portal (or otherwise outside of Pulumi), move the database from A to B
  3. Update the Pulumi code to reflect the move
  4. Run pulumi refresh
  5. Observe the stack file. The database has inputs:elasticPoolId referencing pool B as expected, but both dependencies and propertyDependencies:elasticPoolId still reference pool A.
lunaris commented 5 days ago

Hi @ashlynshatos, thanks for this. So, what you've written sounds OK thus far:

  1. You create 2 pools A and B, with a database D in pool A using Pulumi -- all good
    • At this point, Pulumi sees (as in, in its state) A and B, with D having a dependency on A
  2. Outside of Pulumi, you move D from A to B
    • At this point, Pulumi still sees in its state A and B, with D having a dependency on A
  3. You update your program to reflect the move
    • Since you have not run Pulumi, it still has state showing A and B, with D having a dependency on A
  4. Run pulumi refresh
    • pulumi refresh doesn't look at your program, since its job is to read changes in the provider (Azure) and reflect those in the state
    • After this, Pulumi's state has A and B, with D's pool property now pointing to B
    • However, Pulumi cannot infer from the refresh alone that D's dependency has also moved from A to B (although this is obvious to us). Dependencies are updated when the program runs, so for now these will stay the same in state.
  5. Observe the stack file. It is now incorrect (with respect to your program), although I believe it's still valid

The question now is -- can you reproduce the error you had previously? It looks like you then went on to perform a targeted update on the original pool A, but when I do that locally I can't seem to reproduce the error. Additionally -- were any of the steps above performed on an older version of the Pulumi CLI? (E.g. maybe you performed the refresh locally, and you are running an older copy there?) It's possible perhaps that an older bug has crept in as a result of that.

Thanks again for your patience helping us debug this!