AKS Container Service upgrades not applying to agentpool

caboog commented 1 year ago

What happened?

Upgrading the kubernetes version in a managedcluster instance applies to teh cluster, but does not update the existing nodepools. Changing the nodepool orchestrationVersion to match the cluster k8s versions causes an error

error: Code="NotAllAgentPoolOrchestratorVersionSpecifiedAndUnchanged" Message="Using managed cluster api, all Agent pools' OrchestratorVersion must be all specified or all unspecified. If all specified, they must be stay unchanged or the same with control plane. For agent pool specific change, please use per agent pool operations: https://aka.ms/agent-pool-rest-api"

Steps to reproduce

Create a managedcluster instance with an older than newest version of kubernetes (1.22.15 was mine) and a nodepool
Update to a newer kubernetes version for the cluster (I used 1.23.8) and run an up
Check the version on the running nodes
Update the orchestrationVersion to match the new cluster version and run an up

Expected Behavior

Nodes to get recreated with teh new version after running pulumi up with the new version

Actual Behavior

Nodes are untouched

Output of `pulumi about`

pulumi about
CLI
Version      3.46.0
Go Version   go1.19.2
Go Compiler  gc

Plugins
NAME          VERSION
azure-native  1.85.0
nodejs        unknown

Host
OS       darwin
Version  12.6.1
Arch     arm64

This project is written in nodejs: executable='/opt/homebrew/bin/node' version='v19.0.1'

Current Stack: dev

TYPE                                          URN
pulumi:pulumi:Stack                           urn:pulumi:dev::aks-nodepool-update::pulumi:pulumi:Stack::aks-nodepool-update-dev
pulumi:providers:azure-native                 urn:pulumi:dev::aks-nodepool-update::pulumi:providers:azure-native::default_1_85_0
azure-native:resources:ResourceGroup          urn:pulumi:dev::aks-nodepool-update::azure-native:resources:ResourceGroup::boog
azure-native:containerservice:ManagedCluster  urn:pulumi:dev::aks-nodepool-update::azure-native:containerservice:ManagedCluster::boog-test

Found no pending operations associated with dev

Backend
Name           pulumi.com
URL            https://app.pulumi.com/caboog
User           caboog
Organizations  caboog, pulumi

Dependencies:
NAME                  VERSION
@pulumi/azure-native  1.85.0
@pulumi/pulumi        3.46.1

Additional context

No response

Contributing

Vote on this issue by adding a 👍 reaction. To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

danielrbradley commented 1 year ago

@caboog does the behaviour work correctly when making the same change via the console?

ArgTang commented 1 year ago

@danielrbradley we have the same situation. We need to use newer apis to leverage workload identity

using Pulumi.AzureNative.ContainerService => pulumi up works fine using Pulumi.AzureNative.ContainerService.V20220901 and no other changes => NotAllAgentPoolOrchestratorVersionSpecifiedAndUnchanged

    ~ azure-native:containerservice/v20220901:ManagedCluster: (update)
        [id=/subscriptions/aaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa/resourcegroups/resourcegroup/providers/Microsoft.ContainerService/managedClusters/cluster]
        [urn=urn:pulumi:test::project::azure-native:containerservice/v20220901:ManagedCluster::cluster]
        [provider=urn:pulumi:test::edgecloud::pulumi:providers:azure-native::default_1_87_0::cea81ccd-462b-4405-a208-942e4d27afbd]
      ~ networkProfile: {
          ~ loadBalancerProfile: {
              ~ managedOutboundIPs: {
                  + countIPv6: 0
                }
            }
        }

  azure-native:containerservice/v20220901:ManagedCluster (aksCluster):
    error: Code="NotAllAgentPoolOrchestratorVersionSpecifiedAndUnchanged" Message="Using managed cluster api, all Agent pools' OrchestratorVersion must be all specified or all unspecified. If all specified, they must be stay unchanged or the same with control plane. For agent pool specific change, please use per agent pool operations: https://aka.ms/agent-pool-rified. If all specified, they must be stay uncest-api"

We have tried: All versions is specified. And Managed cluster is set to new CustomResourceOptions { IgnoreChanges = new List<string> { "agentPoolProfiles" } Update to lates nuget/pulumi 1.87.0/v3.48.0 Update nodepool to match controlplane before the change. Current k8s version 1.23.12 Also tried using lower APi version from earlier 2022

ArgTang commented 1 year ago

After a lot of time, i checked the pulumi state file. since we have IgnoreChanges = new List<string> { "agentPoolProfiles" } the ManagedCluster state is not refreshed on run, so there was mismatch in Pulumi State and azure state. once i manually fixed the state it worked again. This is only a cumbersome workaround. Maybe Pulumi could take the source(cloud) values on ignored props to correct its state?

JonAStorelv commented 1 year ago

Any update on this bug?

danielrbradley commented 1 year ago

Unfortunately we haven't yet pinned now the root cause in order to address a fix. I do however have a hypothesis as to the cause which someone might be able to validate: When upgrading the k8s version of the Managed Cluster, it's implicitly changing the OrchestratorVersion - which is then out of sync with Pulumi's state. As @ArgTang mentioned, fixing the sate then resolves the issue.

Rather than editing the state manually, does using pulumi refresh pull in the updated orchestrator version value to the state and resolve the issue, similar to what @ArgTang mentioned?
It would be helpful to have a full code sample which reproduces this issue so we can ensure we're reproducing the issue in exactly the same way - including any required related resource (e.g. role or identity setup).

JonAStorelv commented 1 year ago

This is the code used for the managed cluster:

const managedCluster = new azure_native.containerservice.ManagedCluster("managedCluster", {
            aadProfile: {
                enableAzureRBAC: false,
                managed: true,
            },
            resourceGroupName: resgrpName,
            resourceName: `${nameprefix}-aks`,
            addonProfiles: {
                "omsagent": {
                    enabled: true,
                    config: {
                        logAnalyticsWorkspaceResourceID: WrkspID
                    },
                },
                "azurepolicy": {
                    enabled: true,
                },
                "ingressApplicationGateway": {
                    enabled: true,
                    config: {
                        applicationGatewayId: appgwid
                    }
                },
            },
            agentPoolProfiles: [
                {
                    count: 3,
                    maxPods: 20,
                    mode: "System",
                    name: "nodepool1",
                    osType: "Linux",
                    type: "VirtualMachineScaleSets",
                    vmSize: pulumiConfig.require("vmSize"),
                    vnetSubnetID: subnet.id,
                }
            ],
            enableRBAC: true,
            dnsPrefix: "AzureNativeprovider",
            kubernetesVersion: kubernetesConfig.require("kubernetesVersion"),
            servicePrincipalProfile: {
                clientId: accessApplication.applicationId,
                secret: servicePrincipalPassword.value
            },
            networkProfile: {
                networkPlugin: "azure",
                networkPolicy: "calico",
                serviceCidr: "10.10.0.0/16",
                dnsServiceIP: "10.10.0.10",
                dockerBridgeCidr: "172.17.0.1/16",
                loadBalancerProfile: {
                    outboundIPs: {
                        publicIPs: [{
                            id: pip.id
                        }],
                    }
                },
            },
        },
            {
                dependsOn: resourceGroupAssignment,
                deleteBeforeReplace: true,
                parent: this
            });

JonAStorelv commented 1 year ago

Example after updating cluster through pulumi:

dirien commented 1 year ago

Run into this issue too.

This is my test cluster code:

const resourceGroup = new resources.ResourceGroup("my-resource-group");

const kubernetesVersion = "1.25.6"
const agentKubernetesVersion = "1.24.10"

const managedCluster = new cluster.ManagedCluster("my-cluster", {
    kubernetesVersion: kubernetesVersion,
    location: resourceGroup.location,
    resourceGroupName: resourceGroup.name,
    resourceName: "my-cluster",
    nodeResourceGroup: "my-cluster-nodes",
    dnsPrefix: resourceGroup.name,
    identity: {
        type: "SystemAssigned",
    },
    networkProfile: {
        networkPlugin: "azure",
        networkPolicy: "calico",
    },
    oidcIssuerProfile: {
        enabled: true,
    },
    agentPoolProfiles: [{
        name: "agentpool",
        count: 3,
        vmSize: "Standard_B2ms",
        osType: "Linux",
        osDiskSizeGB: 30,
        type: "VirtualMachineScaleSets",
        mode: "System",
        orchestratorVersion: agentKubernetesVersion,
    },
        {
            name: "workload1",
            count: 3,
            vmSize: "Standard_B2ms",
            osType: "Linux",
            osDiskSizeGB: 30,
            type: "VirtualMachineScaleSets",
            mode: "",
            orchestratorVersion: agentKubernetesVersion,
        }],
});

I went out to try this with Bicep, and I can get this working in Bicep. You only need to put the upgrade to an extra file and run this.

Here is the Gist to the code https://gist.github.com/dirien/b711a14e6c4fe89c86153b3eb99a0807

thomas11 commented 7 months ago

I wrote a full e2e test for this issue in #3078. It passes, though, meaning I cannot reproduce the issue. It's likely this is because the issue and all previous comments were submitted before azure-native v2 came out. The v2 release updated the Azure API versions we use.

Interested parties can check out my PR - first examples/azure-native-sdk-v2/go-aks/main.go, then examples/azure-native-sdk-v2/go-aks/step2, then examples/azure-native-sdk-v2/go-aks/step3 - and double-check if it replicates the scenario of this bug report. I'll close this issue for now, but feel free to re-open.

pulumi / pulumi-azure-native