Fix dirty collection attribute refresh

VenelinMartinov commented 2 weeks ago

This fixes an issue with empty collection attributes refreshing dirty under PlanResourceChange. Note that this is an issue present in TF as well. It comes from the fact that some (most?) providers don't distinguish between null and empty collection values, for example both AWS and GCP labels have this issue.

We've opted for a targeted fix in Refresh - this maintains our TF-conformance on Create/Update. The fix is that on refresh we go over the read values and compare with the previous state. Any properties which:

Are a TF attribute
Are a collection
Are null or empty
Are different to the previous state

get "normalized" to what was in the state. This avoids us displaying excessive dirty refreshes for these values.

Note that the actual fix is from https://github.com/pulumi/pulumi-terraform-bridge/pull/2073 and due to @t0yv0 A lot of the commit history here is from https://github.com/pulumi/pulumi-terraform-bridge/pull/2072 which separated the tests from here to facilitate easier reviews. I should have merged here before deleting 🤷

Fixes https://github.com/pulumi/pulumi-terraform-bridge/issues/2047 Fixes https://github.com/pulumi/pulumi-terraform-bridge/issues/1967

codecov[bot] commented 2 weeks ago

Codecov Report

Attention: Patch coverage is 57.79817% with 46 lines in your changes missing coverage. Please review.

Project coverage is 61.22%. Comparing base (388299d) to head (5d60418). Report is 2 commits behind head on master.

Files	Patch %	Lines
pkg/tfshim/sdk-v2/provider2.go	49.18%	28 Missing and 3 partials :warning:
pkg/tfshim/sdk-v2/walk.go	68.75%	14 Missing and 1 partial :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #2065 +/- ## ========================================== + Coverage 61.21% 61.22% +0.01% ========================================== Files 339 341 +2 Lines 45140 45247 +107 ========================================== + Hits 27633 27704 +71 - Misses 15987 16020 +33 - Partials 1520 1523 +3 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

VenelinMartinov commented 2 weeks ago

I've ran downstream tests here and they are good: https://github.com/pulumi/pulumi-terraform-bridge/issues/1967#issuecomment-2161096367

VenelinMartinov commented 2 weeks ago

Note https://github.com/pulumi/pulumi/pull/16146 is relevant here and is getting merged very soon - it might fix some of the dirty refreshes - should we wait for that and see if we still need to ship this?

VenelinMartinov commented 2 weeks ago

Found a regression caused by this change: https://github.com/pulumi/pulumi-gcp/issues/2079#issuecomment-2163296305

Without the change the Firewall resource refreshes dirty: gcp:compute:Firewall ts-firewall updated (0.48s) [diff: +sourceRanges,sourceServiceAccounts,targetServiceAccounts,targetTags]

But that's what happens in TF too:

  # google_compute_firewall.py_firewall has changed
  ~ resource "google_compute_firewall" "py_firewall" {
        id                      = "projects/pulumi-development/global/firewalls/py-firewall"
        name                    = "py-firewall"
      + source_ranges           = []
      + source_service_accounts = []
      + target_service_accounts = []
      + target_tags             = []
        # (10 unchanged attributes hidden)

        # (2 unchanged blocks hidden)
    }

With the change it still refreshes dirty but in a different parameter.

t0yv0 commented 2 weeks ago

The change intentionally deviates from TF behavior so that by itself is not a problem.

With the change it still refreshes dirty but in a different parameter.

Curious any more details here?

VenelinMartinov commented 2 weeks ago

https://github.com/pulumi/pulumi-gcp/issues/2079#issuecomment-2163296305 has the Read GRPC call. It's in the allows property, the member with "icmp" - ports changes from [] to nil in the Read.

ports there is not set on the inputs but is returned in the Create state - I think we override that.

Yeah:

```json { "method": "/pulumirpc.ResourceProvider/Create", "request": { "urn": "urn:pulumi:p-it-venelins-m-webserver-69654475::webserver::gcp:compute/firewall:Firewall::ts-firewall", "properties": { "__defaults": [ "name", "priority" ], "allows": [ { "__defaults": [], "protocol": "icmp" }, { "__defaults": [], "ports": [ "22", "80" ], "protocol": "tcp" } ], "name": "ts-firewall-d80a454", "network": "04da6b54-80e4-46f7-96ec-b56ff0331ba9", "priority": 1000, "project": "pulumi-development", "sourceTags": [ "foo" ] }, "preview": true }, "response": { "properties": { "__meta": "{\"_new_extra_shim\":{},\"e2bfb730-ecaa-11e6-8f88-34363bc7c4c0\":{\"create\":1200000000000,\"delete\":1200000000000,\"update\":1200000000000}}", "allows": [ { "ports": [ "22", "80" ], "protocol": "tcp" }, { "ports": [], "protocol": "icmp" } ], "creationTimestamp": "04da6b54-80e4-46f7-96ec-b56ff0331ba9", "denies": [], "description": null, "destinationRanges": "04da6b54-80e4-46f7-96ec-b56ff0331ba9", "direction": "04da6b54-80e4-46f7-96ec-b56ff0331ba9", "disabled": null, "enableLogging": "04da6b54-80e4-46f7-96ec-b56ff0331ba9", "id": "04da6b54-80e4-46f7-96ec-b56ff0331ba9", "logConfig": null, "name": "ts-firewall-d80a454", "network": "04da6b54-80e4-46f7-96ec-b56ff0331ba9", "priority": 1000, "project": "pulumi-development", "selfLink": "04da6b54-80e4-46f7-96ec-b56ff0331ba9", "sourceRanges": null, "sourceServiceAccounts": null, "sourceTags": [ "foo" ], "targetServiceAccounts": null, "targetTags": null } }, "metadata": { "kind": "resource", "mode": "client", "name": "gcp" } } ```

VenelinMartinov commented 2 weeks ago

I guess given enough provider examples any behaviour here will be wrong somewhere - I'll try to find the exact opposite behaviour of this, you must have hit it in AWS

t0yv0 commented 2 weeks ago

I'd like to debug the GCP example. Setting this up shortly.

VenelinMartinov commented 1 week ago

Ah, I ran it with --diff:

    ~ gcp:compute/firewall:Firewall: (update)
        [id=projects/pulumi-development/global/firewalls/py-firewall-639a397]
        [urn=urn:pulumi:dev::webserver-py::gcp:compute/firewall:Firewall::py-firewall]
        [provider=urn:pulumi:dev::webserver-py::pulumi:providers:gcp::default_7_25_0::28550d5a-7d52-492a-b894-a7f38e268c2b]
        --outputs:--
      ~ allows               : [
          ~ [0]: {
                  ~ ports   : [
                      + [0]: "22"
                      + [1]: "80"
                    ]
                  ~ protocol: "icmp" => "tcp"
                }
          ~ [1]: {
                  - ports   : [
                  -     [0]: "22"
                  -     [1]: "80"
                    ]
                  ~ protocol: "tcp" => "icmp"
                }
        ]
Resources:
    ~ 1 to update
    3 unchanged

This is likely a reordering issue, not a problem with the values.

t0yv0 commented 1 week ago

I've had a look and I know what's wrong with the GCP example.

The correction kicks in with this:

  REPLACING allow.$.ports FROM cty.ListValEmpty(cty.String) TO cty.NullVal(cty.List(cty.String))

But what's unexpected here is that the state value is actually cty.ListValEmpty(cty.String) so this should not apply.

How does this happen? Turns out "github.com/hashicorp/go-cty/cty" is buggy around sets:

cty.Path apply expects only ints or strings but cty.Transform injects entire set element values into IndexStep. Min repro:

func TestSimpleSet(t *testing.T) {
    vtop := cty.SetVal([]cty.Value{
        cty.ObjectVal(map[string]cty.Value{
            "foo": cty.StringVal("bar"),
        }),
    })

    cty.Transform(vtop, func(p cty.Path, v cty.Value) (cty.Value, error) {
        vv, err := p.Apply(vtop)
        if err != nil || vv.Equals(v).False() {
            t.Logf("err %v", err)
            t.Logf("path %v", p)
        }
        return v, nil
    })
}

As a result, looking for a matching state element under the path fails to find one, and assumes it is missing or explicit null.

This is not good enough:

    matchingOldStateValue := func(t cty.Type, p cty.Path) cty.Value {
        if v, err := p.Apply(oldState); err == nil {
            return v
        }
        return cty.NullVal(t)
    }

t0yv0 commented 1 week ago

https://github.com/zclconf/go-cty/issues/180 is where it's bottoming out. As a workaround we can skip paths that contain set elements and just bail for this transformation. We can bail on error from p.Apply. Or we can rewrite p.Apply into a "better" version.

t0yv0 commented 1 week ago

We're weighing now whether to include this change or not. This introduces an intentional deviation in Read results for Pulumi bridged providers vs TF baseline to compensate for old normalizeNullValues code in TF that intentionally confuses empty and null collections leading to dirty refresh in Pulumi. With this change the Read result is modified to match the state and avoid having a diff. Under the new CLI https://github.com/pulumi/pulumi/releases/tag/v3.120.0 some (or all?) these dirty refresh scenarios may refresh cleanly.

Should we still include?

t0yv0 commented 1 week ago

Reasons to include: the change is fairly surgical and well motivated, hedge against any subsequent changes in the CLI

Reasons not to include: any change carries some unforeseen risk, having an intentional difference from TF behavior creates more maintenance

Information that could help decide:

can we confirm that existing test cases pass on the latest CLI without the change?
are there any more refresh changes to try out? (@t0yv0 can scan through refresh tickets in AWS classic to try them)
https://github.com/pulumi/pulumi/pull/16146 only changes refresh, so any non-refresh uses are unchanged - user programs might depend on certain collections being non-nil and this PR maintains the state of these after a refresh.

t0yv0 commented 1 week ago

@mjeffryes sounds like you had some input here.

mjeffryes commented 1 week ago

Sure.

https://github.com/pulumi/pulumi/pull/16146 has landed and appears to address all our test failure for "dirty refresh" and should mitigate that problem for users. It does change the outputs in state, so that means that resources that depend on that output might try to update after a "clean" refresh (assuming the null/empty distinction matters to the dependent resource).
The normalization change we've prepared would completely hide when upstream returns a nil instead of empty or vice-versa, so it would suppress updates to dependent resources even if the difference between null and empty was somehow significant.
It's honestly not clear how often this difference in behavior is actually significant, but the behavior without normalization is both more transparent to the user and more consistent with the upstream TF provider. Hence, I'm inclined to try rolling out PRC without normalization and see if users encounter any real situations where they wish we would have normalized to suppress the update.

VenelinMartinov commented 1 week ago

Running RPC downstream tests without this PR here: https://github.com/pulumi/pulumi-terraform-bridge/issues/1962#issuecomment-2167990221

Will report back if any tests need this change.

EDIT: Quite a few failures in AWS related to tags: https://github.com/pulumi/pulumi-aws/actions/runs/9515746758/job/26232504775?pr=4078

Also failures in cloudflare related to tags: https://github.com/pulumi/pulumi-cloudflare/actions/runs/9515335359/job/26229764573?pr=813

EDIT2: The AWS failures were because it hasn't yet picked up the new CLI version. After testing in https://github.com/pulumi/pulumi-aws/pull/4081, it looks good.

VenelinMartinov commented 1 week ago

Also note that under the new refresh behaviour the tests fixed in this PR no longer fail without it: https://github.com/pulumi/pulumi-terraform-bridge/pull/2100. If we were to merge it we need to add test coverage for the fix.

EDIT: I'll try to pull the downstream test failures into tests here.

t0yv0 commented 1 week ago

Yes sounds like these things got fixed in the CLI so perhaps this PR is not necessary anymore. Is that too simplistic of a take?

VenelinMartinov commented 2 days ago

Closing this as the issues this meant to fix are fixed by the new CLI refresh.

PlanResourceChange passes downstream tests without this, so let's revisit if we see any issues related to collections.

pulumi / pulumi-terraform-bridge

Fix dirty collection attribute refresh #2065

Codecov Report