Fresh GCP GKE Clusters don't match the pulumi state

ToxicCobra commented 10 months ago

What happened?

First, here's a bit of context:

We use Pulumi in python. To build our infra as code, we firstly build a bunch of different python classes as "abstract objects" with various relations between each other for our own logic. Inside some of those objects, we create actual GCP resources. For example, we have a Python class named KubernetesCluster which has a bunch of different variables and methods that manage a few different things in our infra. Inside each instance of the KubernetesCluster, we create a GCP GKE cluster using from pulumi_gcp import container and then we define the GKE cluster like this:

self.cluster = container.Cluster(
            name,
            name=name,
            project=project_id,
            location=region,
            network=network,
            subnetwork=subnet,
            initial_node_count=1,
            remove_default_node_pool=True,
            node_config={
                "loggingVariant": "DEFAULT",
                "preemptible": False,
                "spot": False
            },
            master_auth={
                "client_certificate_config": {
                    "issue_client_certificate": True
                },
            },
            addons_config={
                "http_load_balancing": {"disabled": True},
                "horizontal_pod_autoscaling": {"disabled": True},
            },
            opts=pulumi.ResourceOptions(parent=self)
        )

The variables used and their values isn't relevant, the point is that we create GCP resources using pulumi in python.

As you can tell, we delete the default node pool when we create a GKE cluster but we create other node pools later in the code, also under an "abstract python class" called NodePool.

The problem we're having is this:

We provision the entire infrastructure containing VPCs, Subnets, GKE Clusters with their Node pools and so on using pulumi up
Everything gets created successfully
If we immediately re-run pulumi up, already looks like the state is out of sync so it wants to replace all the GKE clusters.

If I run that, it destroys the GKE clusters, re-creates new ones, but then the NodePools are out of sync. So because we specify to delete the default node pool, we have a bunch of clusters with 0 nodes.

Re-run pulumi up, destroys and recreates empty GKE clusters again.

The only way I can get my infra created as intended, I have to pulumi destroy everything, then deploy everything from scratch again. It's a huge issue because the moment I have any modification made to the code, even if it's not something related to either GKE clusters or Node Pools, it wants to wipe the clusters and replace them with empty ones.

Looking at the differences between what's in the pulumi state and the changes needed from a pulumi up, there are no changes anywhere.

Got any idea? Need any more info from my end? Please let me know!

Example

pulumi up deploys the full infrastructure from a fresh state.
pulumi up again without any changes made anywhere, it wants to destroy all the GKE clusters.

Output of `pulumi about`

CLI
Version 3.89.0 Go Version go1.21.1 Go Compiler gc

Plugins NAME VERSION gcp 6.67.0 python unknown

Host
OS darwin Version 14.0 Arch arm64

Additional context

No response

Contributing

Vote on this issue by adding a 👍 reaction. To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

ToxicCobra commented 10 months ago

Fun discovery, running pulumi refresh refreshes the state then the next pulumi up doesn't want to wipe all the clusters anymore... Not sure what caused the initial issue but at least that's works for now if anyone has a similar issue.

Just make sure you refresh the state before trying to deploy.

Frassle commented 10 months ago

This looks like a gcp diff bug so moving repos.

mikhailshilkov commented 10 months ago

Great to hear you've unblocked yourself by a refresh.

This looks similar to https://github.com/pulumi/pulumi-gcp/issues/744 which was closed by OP after applying an ignoreChanges workaround. You can probably do the same, but I'll leave the issue open to investigate if we can do better here.

mjeffryes commented 6 days ago

I suspect this may be fixed with https://github.com/pulumi/pulumi-gcp/pull/2277; if you still encounter this issue please leave a comment so we can prioritize it.

pulumi / pulumi-gcp