Open kralikba opened 3 years ago
@kralikba Apologies for the no response, the issue slipped out. Do you still have this problem? If so, could you share your code and repro steps?
In general, any update to a resource results in a single PUT request, that's how the provider works (similarly to ARM templates).
Hello,
We have decided not to use AKS so I can't report on this. Maybe this behaviour cannot be correctly described using the schemas used to generate the provider?
I believe that I'm not wrong in generally expecting a code change and then running pulumi up
to result in a configuration of my cloud resources matching that described in the new code, whatever the required concrete steps are.
Yes, this is the idea.
As it's not a problem for you anymore, I'll close the issue for now. If anyone is facing a similar problem and has a repro, please comment below.
I get this same issue, when i try to rename the agentpoolProfiles.
using Azure native 1.9.0
cluster:
`const managedCluster = new azure_native.containerservice.ManagedCluster("managedCluster", {
resourceGroupName: resourceGroup.name,
resourceName: `resourceName`,
addonProfiles: {},
agentPoolProfiles: [{
count: 3,
maxPods: 110,
mode: "System",
name: `someName`, <-- changing this
osDiskSizeGB:30,
osType:"Linux",
type:"VirtualMachineScaleSets",
vmSize:"Standard_D4_v3",
vnetSubnetID: subnet.id
}],
dnsPrefix: "AzureNativeprovider",
enableRBAC: true,
kubernetesVersion: "1.19.7",
servicePrincipalProfile: {
clientId: application.applicationId,
secret: servicePrincipalPassword.value
},
networkProfile: {
networkPlugin: "azure",
serviceCidr: "10.10.0.0/16",
dnsServiceIP: "10.10.0.10",
dockerBridgeCidr: "172.17.0.1/16"
}
},
{
dependsOn: subnetAssignment,
deleteBeforeReplace: true,
parent: this
});`
Error message:
Get this error message:
`Code="BadRequest" Message="A new agent pool was introduced. Adding agent pools to an existing cluster is not allowed through managed cluster operations. For agent pool specific change, please use per agent pool operations: https://aka.ms/agent-pool-rest-api" Target="agentPoolProfiles"`
@Bj3MaS yeah I think it's not directly possible to rename the system pool like this. @viveklak do you know a clear workaround?
There is a general issue with ARM and node pools.
At the AKS level, there are system pools and user pools. As long as there is at least one system pool, you can add/remove them at will.
Then there are the ARM resources: ManagedCluster
has a property agentPoolProfiles
, and you can also add a standalone pool with AgentPool
.
Now the fun begins. If you do create a new AgentPool
, the ManagedCluster
will detect a change to the property agentPoolProfiles
at the next refresh.
On the other hand, if you want to add a new pool to an existing cluster, you can't just edit AgentPool.agentPoolProfiles
because then you get the error
error: Code="BadRequest" Message="A new agent pool was introduced. Adding agent pools to an existing cluster is not allowed through managed cluster operations. For agent pool specific change, please use per agent pool operations: https://aka.ms/agent-pool-rest-api" Target="agentPoolProfiles"
So the workaround is to add ignoreChanges: ["agentPoolProfiles"]
to ManagedCluster
. And probably just do the node pool operations in the Azure Portal.
@0x53A I think this deserves a separate issue. It's a similar kind of issue as https://github.com/pulumi/pulumi-azure-native/issues/611 but for different resources.
Reopening as it looks like there is a repro for this issue and it's still likely occurring.
While this pair of resources does follow similar nesting to #611, the behaviour observed is very different. Based on the error message:
A new agent pool was introduced. Adding agent pools to an existing cluster is not allowed through managed cluster operations. For agent pool specific change, please use per agent pool operations: https://aka.ms/agent-pool-rest-api
and the documentation note on this page:
This feature enables more control over creating and managing multiple node pools and requires separate commands for create/update/delete operations. Previously, cluster operations through az aks create or az aks update used the managedCluster API and were the only options to change your control plane and a single node pool. This feature exposes a separate operation set for agent pools through the agentPool API and requires use of the az aks nodepool command set to execute operations on an individual node pool.
my reading of this is that only the first agent pool profile is managed via the parent ManagedCluster. All additional agent pool profiles must be managed and standalone resource.
This would explain the observed error because when the name is changed in the array of agentPoolProfiles
Azure interprets this as the creation of a new profile and the deletion of the old profile - which then results in the error about not being able to create additional profiles.
This is a rather odd limitation of the API design so we might not be able to completely hide this implementation detail. However, I'll leave this open as perhaps we can find a way of improving the usability or error message here to help explain the failure to users.
As a next step, it would be good to have additional evidence which either: proves that the hypothesis of only the first profile being managable via the ManagedCluster; or shows that we've not yet fully understood the behaviour here.
Here's the current situation as I understand it.
I've defined a cluster like so:
cluster = containerservice.ManagedCluster(
"cluster",
resource_group_name=resource_group.name,
agent_pool_profiles=[
containerservice.ManagedClusterAgentPoolProfileArgs(
count=3,
mode="System",
name="system",
os_disk_size_gb=30,
os_type="Linux",
type="VirtualMachineScaleSets",
vm_size="Standard_DS2_v2",
),
containerservice.ManagedClusterAgentPoolProfileArgs(
count=3,
mode="User",
name="workloads",
os_disk_size_gb=30,
os_type="Linux",
type="VirtualMachineScaleSets",
vm_size="Standard_DS2_v2",
)
],
enable_rbac=True,
linux_profile=containerservice.ContainerServiceLinuxProfileArgs(
admin_username="lbriggs",
ssh=containerservice.ContainerServiceSshConfigurationArgs(
public_keys=[
{
"key_data": ssh_key.public_key_openssh,
}
]
),
),
dns_prefix=resource_group.name,
service_principal_profile=containerservice.ManagedClusterServicePrincipalProfileArgs(
client_id=ad_app.application_id,
secret=ad_sp_password.value,
)
)
If I remove the user agent pool, I get a diff and it deletes the pool successfully.
If I modify the user pool VM size, it causes a replace of the whole cluster.
If I try to add a new agent pool, like so:
cluster = containerservice.ManagedCluster(
"cluster",
resource_group_name=resource_group.name,
agent_pool_profiles=[
containerservice.ManagedClusterAgentPoolProfileArgs(
count=3,
mode="System",
name="system",
os_disk_size_gb=30,
os_type="Linux",
type="VirtualMachineScaleSets",
vm_size="Standard_DS2_v2",
),
containerservice.ManagedClusterAgentPoolProfileArgs(
count=3,
mode="User",
name="workloads",
os_disk_size_gb=30,
os_type="Linux",
type="VirtualMachineScaleSets",
vm_size="Standard_DS2_v2",
),
containerservice.ManagedClusterAgentPoolProfileArgs(
count=3,
mode="User",
name="newworkloads",
os_disk_size_gb=30,
os_type="Linux",
type="VirtualMachineScaleSets",
vm_size="Standard_DS2_v2",
)
],
enable_rbac=True,
linux_profile=containerservice.ContainerServiceLinuxProfileArgs(
admin_username="lbriggs",
ssh=containerservice.ContainerServiceSshConfigurationArgs(
public_keys=[
{
"key_data": ssh_key.public_key_openssh,
}
]
),
),
dns_prefix=resource_group.name,
service_principal_profile=containerservice.ManagedClusterServicePrincipalProfileArgs(
client_id=ad_app.application_id,
secret=ad_sp_password.value,
)
)
I get the error above.
I do think the only way to do this is to only specify a system node pool
This appears to be generally working as designed by Azure.
The AKS resource lifecycle isn't an ideal pattern for being managed declaritively, as migrating from an initial system node group to a new system node group must be managed somewhat imperitively, then re-synchronised back to code.
There are some oddities around how refresh works - it appears the provider might be actively removing parts of the diff.
Relevant discussion of the same topic relating to ARM templates: Add nodepool to an existing AKS cluster via ARM template
This is also why we've disabled forced recreation of the cluster on these sub-property changes:
Here's the base program - similar to the one by @jaxxstorm ... filling in the missing dependencies:
import pulumi
from pulumi_azure_native import resources, containerservice
from pulumi_tls import PrivateKey
from pulumi_azuread import Application, ServicePrincipal, ServicePrincipalPassword
# Create a new resource group
resource_group = resources.ResourceGroup('rg')
ssh_key = PrivateKey("ssh-key", algorithm="RSA", rsa_bits=4096)
ad_app = Application("ad-app", display_name="ad-app")
ad_sp = ServicePrincipal("ad-sp", application_id=ad_app.application_id)
ad_sp_password = ServicePrincipalPassword("ad-sp-password", service_principal_id=ad_sp.id,
end_date_relative="8760h")
cluster = containerservice.ManagedCluster(
"cluster",
resource_group_name=resource_group.name,
agent_pool_profiles=[
containerservice.ManagedClusterAgentPoolProfileArgs(
count=3,
mode="System",
name="system",
os_disk_size_gb=30,
os_type="Linux",
type="VirtualMachineScaleSets",
vm_size="Standard_DS2_v2",
),
],
enable_rbac=True,
linux_profile=containerservice.ContainerServiceLinuxProfileArgs(
admin_username="admin",
ssh=containerservice.ContainerServiceSshConfigurationArgs(
public_keys=[
{
"key_data": ssh_key.public_key_openssh,
}
]
),
),
dns_prefix=resource_group.name,
service_principal_profile=containerservice.ManagedClusterServicePrincipalProfileArgs(
client_id=ad_app.application_id,
secret=ad_sp_password.value,
)
)
requirements.txt:
pulumi>=3.0.0,<4.0.0
pulumi-azure-native>=2.0.0,<3.0.0
pulumi-tls>=4.0.0,<5.0.0
pulumi-azuread>=5.0.0,<6.0.0
Initial deployment with the single system node group succeeded.
Note: I had to retry the ManagedCluster deployment after an error stating "service principal is not found". This appears to be because Azure's AD has some form of eventual consistency and the create request returns before the SP has fully replicated.
Reproducing the reported issue above:
~ agentPoolProfiles: [
+ [1]: {
+ count : 3
+ mode : "User"
+ name : "workloads"
+ osDiskSizeGB: 30
+ osType : "Linux"
+ type : "VirtualMachineScaleSets"
+ vmSize : "Standard_DS2_v2"
}
]
This does not work and results in the same error message as above:
Code="BadRequest" Message="A new agent pool was introduced. Adding agent pools to an existing cluster is not allowed through managed cluster operations. For agent pool specific change, please use per agent pool operations: https://aka.ms/agent-pool-rest-api" Target="agentPoolProfiles"
Next I wanted to see if it was only User mode pools which can't be added inline, so I attempted to add a second system pool using the inline property:
containerservice.ManagedClusterAgentPoolProfileArgs(
count=3,
mode="System",
name="system2",
os_disk_size_gb=30,
os_type="Linux",
type="VirtualMachineScaleSets",
vm_size="Standard_DS2_v2",
)
~ agentPoolProfiles: [
+ [1]: {
+ count : 3
+ mode : "System"
+ name : "system2"
+ osDiskSizeGB: 30
+ osType : "Linux"
+ type : "VirtualMachineScaleSets"
+ vmSize : "Standard_DS2_v2"
}
]
This fails with the same error, therefore we can't create any pools inline after the initial creation.
containerservice.AgentPool("cluster-agent-pool",
resource_group_name=resource_group.name,
resource_name_=cluster.name,
agent_pool_name="system2",
count=3,
mode="System",
os_disk_size_gb=30,
os_type="Linux",
type="VirtualMachineScaleSets",
vm_size="Standard_DS2_v2")
This succeeded without error, creating a second agent pool.
Having deploy a second system agent pool, I removed the inline system agent pool:
- containerservice.ManagedClusterAgentPoolProfileArgs(
- count=3,
- mode="System",
- name="system",
- os_disk_size_gb=30,
- os_type="Linux",
- type="VirtualMachineScaleSets",
- vm_size="Standard_DS2_v2",
- ),
This deployment succeeded without error, but looking at the cluster in the portal showed that the original system agent pool had not been deleted. Performing a refresh after didn't show any changes - the external service didn't re-add the original system agent pool to the state which was unexpected.
This fails, as expected, as at least one inline agent pool is required.
Starting again from a clean deployment of the initial program above, then deleting the inline agent pool after the first deployment succeeds. Refreshing shows no changes. The pool has not been deleted.
Re-adding the inline pool and deploying succeeds. No additional agent pool is added in the portal. Refreshing shows no changes.
Repeating steps 4 and 5 but then also manually deleting the original node pool via the UI.
The manual deletion completes successfully. When performing a refresh, we now see:
~ agentPoolProfiles : [
~ [0]: {
count : 3
currentOrchestratorVersion: "1.26.6"
enableFIPS : false
kubeletDiskType : "OS"
maxPods : 110
mode : "System"
~ name : "system" => "system2"
nodeImageVersion : "AKSUbuntu-2204gen2containerd-202309.06.0"
orchestratorVersion : "1.26.6"
osDiskSizeGB : 30
osDiskType : "Ephemeral"
osSKU : "Ubuntu"
osType : "Linux"
powerState : {
code: "Running"
}
provisioningState : "Succeeded"
+ scaleDownMode : "Delete"
type : "VirtualMachineScaleSets"
vmSize : "Standard_DS2_v2"
}
]
The next deployment after refresh shows the agent pool profiles being removed again but succeeds without error, and without deleting the actual underlying agent pool.
~ agentPoolProfiles: [
- [0]: {
- count : 3
- enableFIPS : false
- kubeletDiskType : "OS"
- maxPods : 110
- mode : "System"
- name : "system2"
- orchestratorVersion: "1.26.6"
- osDiskSizeGB : 30
- osDiskType : "Ephemeral"
- osSKU : "Ubuntu"
- osType : "Linux"
- powerState : {
- code: "Running"
}
- scaleDownMode : "Delete"
- type : "VirtualMachineScaleSets"
- upgradeSettings : {}
- vmSize : "Standard_DS2_v2"
}
]
When exporting the ARM template via the portal. All node groups which are visible are included in the agentPoolProfiles
propertery in the exported ManagedCluster resource.
Following on from yesterday's investigations, I believe the ideal behaviour here would be to only use the inline agent pool profiles to manage these sub resources.
It's impossible to manage all pools as external AgentPool resources because at least one pool is required to be defined at the point of creation.
In order to allow us to keep a purely declarative model where we're just describing the desired state (rather than initial, plus migrations), we should therefore model all pools innline.
I can't currently see a way to infer this mixed behaviour of defining all pools inline but then deferring modifications to the separate endpoints directly from the specification.
There is also an ongoing parallel issue for ARM templates, which we could wait to see how Microsoft resolves this, as their resolution will almost certainly impact us too:
agentPoolProfiles
are included.agentPoolProfiles
and delegate to the default implementation.agentPoolProfiles
against state and make individual calls to create, then delete as requied.This would require for all agent pools to be defined inline and not as separate resources. Any agent pools defined as a separate resource would be delete the next time that the ManagedCluster was deployed. We could add a deprecation notice to the AgentPool resource highlighting this potential issue and recommend that people update their code with all pools defined inline.
Hello,
When the set of agent pools in ManagedCluster.AgentPoolProfiles is changed, the provider sends an incorrect request to the Azure API instead a set of requests:
Proposal
Phase 1: Add additional docs to explain the oddities in the ARM behaviour
Phase 2: