pulumi / pulumi-azure-native

Azure Native Provider
Apache License 2.0
126 stars 33 forks source link

Each update in AKS config will recreate the cluster #1808

Closed houssems closed 11 months ago

houssems commented 2 years ago

What happened?

I've added scaling for a pool in AKS. When I launch the command pulumi up, it recreate the cluster from zero.

++azure-native:containerservice:ManagedCluster: (create-replacement)

[id=/subscriptions/{subsid}/resourcegroups/{rgId}/providers/Microsoft.ContainerService/managedClusters/qa-odb-aks] [urn=urn:pulumi:odb_qa::aks::azure-native:containerservice:ManagedCluster::qa-odb-aks] [provider=urn:pulumi:odb_qa::aks::pulumi:providers:azure-native::default_1_64_0::8cd2140e-cbd6-4353-b546-29f72f80ff3b] ~ agentPoolProfiles: [ ~ [1]: {

  • enableAutoScaling: true
  • maxPods : 30 } ]

Steps to reproduce

update cluster config with any anything, the cluster got recreated. For It was the scaling of node pool.

const managedCluster = new azurenative.containerservice.ManagedCluster(aksConfig.name, { addonProfiles: { omsAgent: { enabled: true, config: { logAnalyticsWorkspaceResourceID: workspace.id } }, }, nodeResourceGroup: rgNodeName, disableLocalAccounts: false, agentPoolProfiles: aksConfig.nodes.map( => { .vnetSubnetID = ( vnet.subnets)[0].id; return ; }), dnsPrefix: aksConfig.dnsPrefix, enablePodSecurityPolicy: false, enableRBAC: true, identity: { type: azure_native.containerservice.ResourceIdentityType.SystemAssigned }, kubernetesVersion: "", linuxProfile: { adminUsername: credentials.get(aksSshAdminUser) || 'azureuser', ssh: { publicKeys: [{ keyData: credentials.get(aksSshKeysPublicKey) || '', }], }, }, networkProfile: { networkPlugin: "azure", dnsServiceIP: aksConfig.networkProfile.dnsServiceIp, serviceCidr: aksConfig.networkProfile.serviceCidr, dockerBridgeCidr: aksConfig.networkProfile.dockerBridgeCidr, loadBalancerProfile: { managedOutboundIPs: { count: 1, }, }, loadBalancerSku: "standard", outboundType: "loadBalancer", }, resourceGroupName: resourceGroup.name, resourceName: aksConfig.name, servicePrincipalProfile: { clientId: credentials.get(aksServicePrincipalId) || '', secret: credentials.get(aksServicePrincipalPassword), }, sku: { name: "Basic", tier: "Free", }, tags: context.getTags(), });

and config file was

aks: dnsPrefix: qa-odb-aks loadbalancer: ip: 10.2.1.129 name: qa-odb-aks networkProfile: dnsServiceIp: 10.2.2.254 dockerBridgeCidr: 172.17.0.1/16 serviceCidr: 10.2.2.0/24 nodes:

  • count: 1 enableNodePublicIP: false mode: System name: cpnodepool osType: Linux type: VirtualMachineScaleSets vmSize: Standard_F4s_v2
  • count: 1 enableNodePublicIP: false mode: User name: envpool nodeLabels: environment: qa osType: Linux type: VirtualMachineScaleSets vmSize: Standard_F4s_v2 vnet: addressPrefixes:
  • 10.2.0.0/16 name: qa-odb-aks-vnet subnets:
  • addressPrefix: 10.2.1.0/24 name: qa-odb-aks-subnet

to

aks: dnsPrefix: qa-odb-aks loadbalancer: ip: 10.2.1.129 name: qa-odb-aks networkProfile: dnsServiceIp: 10.2.2.254 dockerBridgeCidr: 172.17.0.1/16 serviceCidr: 10.2.2.0/24 nodes:

  • count: 1 enableNodePublicIP: false mode: System name: cpnodepool osType: Linux type: VirtualMachineScaleSets vmSize: Standard_F4s_v2
  • count: 1 enableNodePublicIP: false mode: User name: envpool nodeLabels: environment: qa osType: Linux type: VirtualMachineScaleSets vmSize: Standard_F4s_v2 enableAutoScaling: true minCount: 1 maxCount: 5 maxPods: 30 vnet: addressPrefixes:
  • 10.2.0.0/16 name: qa-odb-aks-vnet subnets:
  • addressPrefix: 10.2.1.0/24 name: qa-odb-aks-subnet

    Expected Behavior

update only node pool config in Azure

Actual Behavior

Cluster got deleted and recreated

Versions used

v3.34.1

Additional context

No response

Contributing

Vote on this issue by adding a 👍 reaction. To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

lblackstone commented 2 years ago

Hi @houssems, could you add the code you used to create the cluster to make sure that we have accurate repro steps for this issue? Thanks!

houssems commented 2 years ago

Hi @lblackstone, I've updated the config files

Bj3MaS commented 2 years ago

Had similar problem, I used this to add additional agent pool https://www.pulumi.com/registry/packages/azure-native/api-docs/containerservice/agentpool/. I think it's an arm issue and not and Pulumi issue, i think this stem from the time aks could only have one agentpool, but not 100% certain

ArgTang commented 2 years ago

@lblackstone we have the same issue

If you create an AKS cluster with the ManagedCluster resource and change elements in the AgentPoolProfiles Pulumi wants to recreate the cluster.

Any change that Azure will say that you have to recreate the nodepool, pulumi wants to recreate the cluster. IMO Pulumi should not recreate the cluster on nodepool changes.

Another note: if you add a new nodepool ex newnodepool, and remove config for oldnodepool. Pulumi will not detect this change correctly. Instead of deleting oldnodepool and creating newnodepool it will try to update oldnodepool with newnodepool settings. My guess is that pulumi diffs the array without taking the nodepoolname into context, which would be the better behaviour.

versions: pulumi-cli v3.37.2

  <ItemGroup>
    <PackageReference Include="Pulumi" Version="3.*" />
    <PackageReference Include="Pulumi.Azure" Version="5.14.0" />
    <PackageReference Include="Pulumi.AzureAD" Version="5.26.1" />
    <PackageReference Include="Pulumi.AzureNative" Version="1.*" />
    <PackageReference Include="Pulumi.Kubernetes" Version="3.20.2" />
    <PackageReference Include="Pulumi.Tls" Version="4.6.0" />
  </ItemGroup>
danielbichuetti commented 1 year ago

We faced similar issues on AKS recreation when we refreshed cluster state. Since Azure itself reorganizes the state of some variables, which include the subnets, if you deploy they internally into the AKS template. Pulumi is now trying to recreate the whole cluster. This includes some Agent Pools, as Azure Spot Pools are forced by the internal Azure system to be tagged in a special way.

Another common issue is adding node pools. When you refresh the base cluster setup, it's doing the wrong correlation between the node pool used during cluster creation and any of the new ones. This will lead to a node pool replacement, and then a deletion.

danielrbradley commented 11 months ago

My initial reading of this issue thread indicates there might be two separate problems being encountered:

  1. A property diff for maxPods within agentPoolProfiles which is marked as forceNew: true when changing the maxPods property.
  2. Ordering of agentPoolProfiles changing during a refresh as it's not stored as ordered in Azure, but it's really a set.

Problem 1 - updating maxPods

This appears to be behaving as defined by the documenation:

If you want to change various immutable settings on existing node pools, you can create new node pools to replace them. One example is to add a new node pool with a new maxPods setting and delete the old node pool.

Therefore, as I stands this appears to be working entirely correctly as attempting to update this number would result in a failed update.

While technically correct, the non-ideal behaviour here is that we don't actually need to re-create the whole cluster, but could just recreate the node pool instead. This would happen if all node pools were modelled as independent resource but in this instance, system node pools are not able to be managed in this way.

Semi-Manual Solution

We should ensure that users can:

  1. Add a new profile to agentPoolProfiles with their new desired settings.
  2. Remove the old profile once the new profile has been adopted.

The simplest way of ensuring this might be to disable the forceNew: true on all agentPoolProfiles fields.

Theoretically Perfect Solution

Provider changes:

  1. Remove the forceNew: true on all ManagedClusterAgentPoolProfile properties.
  2. Intercept the Update call if agentPoolProfiles has changed:
    1. Deploy with a union of the new and old profiles.
    2. Deploy with only the new profiles to remove the old profiles.
    3. Continue with update as normal.

Open Questions:

  1. Is there any other resources with a similar pattern of nested properties which can be independently rotatated?
  2. Is there any additional mechanisms required to wait for the profile to migrate over?
  3. Can we do the addition of the new profile and the removal of the old profile in a single step or do both need to exist together for the migration to work?
  4. How do we manage the names of these sub-resources dynamically? Do we need a variation of auto-naming to always add a random suffix? How do we make these backward compatible and allow users to override autonaming? Do we just add an incrementing version number suffix?

Problem 2 - agentPoolProfiles ordering

The specification doesn't appear to give any hint on which field is the unique identifier so we can infer that the array should be treated as a set.

On the agentPoolProperties it defines "x-ms-identifiers": []- it's missing name as the identifier.

Reference: spec for x-ms-identifiers

The clearest route here would be to:

  1. Add support for x-ms-identifiers in our metadata & diffing strategy.
  2. Raise an issue with Microsoft to annotate this property correctly.
danielrbradley commented 11 months ago

Since we've disabled the forced-recreation of the AKS cluster on the agentPoolProfiles in #2774 the core of this issue is now resolved. I've logged a follow-up issue to also improve how we handle diffing arrays where there's a known key for items:

I've also logged a separate issue to investigate re-enabling sub-property resource recreation correctly: