Closed NotMyFault closed 7 months ago
trying to manually trigger an arm64 node in the node pool, it seems that the autoscale fail.
once the first node spawn, the autoscale works:
I opened an issue with azure/microsoft,
the detail of my ticket are not in the email but the title was:
Issue Definition: autoscalling not working from 0 but does from 1 node
the first feedback is :
Hi Stephane,
I hope you're doing well.
Starting the scaling at 0 may prevent the autoscaling from working as expected. The cluster autoscaler component watches for pods that can't be scheduled due to resource constraints. If there are no unscheduled pods, the autoscaler may not trigger the scale-up operation. It is recommended to start the scaling at 1 to ensure proper functioning of the autoscaling mechanism.
if my understanding is correct I probably explained badly the problem, so I replied:
Hi All,
I hope you're doing well.
`The cluster autoscaler component watches for pods that can't be scheduled due to resource constraints.` this is exactly the problem, arm64 pods cannot be schedule as there is no node in the nodepool. So I expect the autoscaler to spawn one node.
Starting the scaling from 1 mean spending more money, with a node for nothing for part of the time.
Stéphane
We had similar problems with spot instances, if we had it set to 0 it didn't work, whereas 1 did.
We had similar problems with spot instances, if we had it set to 0 it didn't work, whereas 1 did.
Thanks @timja, I did tell the azure support team I had over on visio, they will look into it, will keep you informed here but it sound right that the spot instance are the trouble here.
We had similar problems with spot instances, if we had it set to 0 it didn't work, whereas 1 did.
Thanks @timja, I did tell the azure support team I had over on visio, they will look into it, will keep you informed here but it sound right that the spot instance are the trouble here.
Given we can't afford (in the current subscription) to use non-spot instances, WDYT if we start working on adding a new AKS cluster only for infra.ci.jenkins.io and release.ci.jenkins.io Kubernetes agents in the new "sponsored" subscription.
Multiple achievement for us:
If it make sense, I propose to close this issue and track the solution above in a new one (ping @smerle33 if you don't mind writing it).
The new issue would need to mention:
arm64
" as part of https://github.com/jenkins-infra/helpdesk/issues/3823
=> Let's scope the initial implementation to only infra.ci.jenkins.io agents, and only 1 "non system" nodepool of type linux/arm64
so we can start switching workloads out of privatek8s
.Is there an issue to just leave 1 spot instance?
Is there an issue to just leave 1 spot instance?
Cost 😅 (but might be OK though @smerle33 do you mind checking the cost?)
Is there an issue to just leave 1 spot instance?
Cost 😅 (but might be OK though @smerle33 do you mind checking the cost?)
I think its quite acceptable ... less than 15$/month
last answer from microsoft confirm our choice :
According to the microsoft guidelines,
One node should be present where there are unscheduled pods, so that cluster can further autoscale based on those unscheduled pods.
Starting the scaling at 0 may prevent the autoscaling from working as expected.
The cluster autoscaler component watches for pods that can't be scheduled due to resource constraints.
If there are no unscheduled pods, the autoscaler may not trigger the scale-up operation.
It is recommended to start the scaling at 1 to ensure proper functioning of the autoscaling mechanism.
We don't have any official documents for the spot instance.
new answer from microsoft :
One node should be present where there are unscheduled pods, so that cluster can further autoscale based on those unscheduled pods.
Starting the scaling at 0 may prevent the autoscaling from working as expected.
The cluster autoscaler component watches for pods that can't be scheduled due to resource constraints.
If there are no unscheduled pods, the autoscaler may not trigger the scale-up operation.
It is recommended to start the scaling at 1 to ensure proper functioning of the autoscaling mechanism.
Isn't it exactly the same response as last week minus the last phrase? ^^
Service(s)
infra.ci.jenkins.io
Summary
The build queue counts 30+ items while writing this, but the executor status is stuck in the launching state:
ref https://infra.ci.jenkins.io/job/kubernetes-jobs/job/kubernetes-management/view/change-requests/job/PR-4886/ and other PRs
Reproduction steps
No response