Agents aren't spawning on infra.ci

NotMyFault commented 8 months ago

Service(s)

infra.ci.jenkins.io

Summary

The build queue counts 30+ items while writing this, but the executor status is stuck in the launching state: Screenshot 2024-01-21 at 16 06 27

ref https://infra.ci.jenkins.io/job/kubernetes-jobs/job/kubernetes-management/view/change-requests/job/PR-4886/ and other PRs

Reproduction steps

No response

smerle33 commented 8 months ago

trying to manually trigger an arm64 node in the node pool, it seems that the autoscale fail.

smerle33 commented 8 months ago

once the first node spawn, the autoscale works:

smerle33 commented 8 months ago

I opened an issue with azure/microsoft,

the detail of my ticket are not in the email but the title was: Issue Definition: autoscalling not working from 0 but does from 1 node

the first feedback is :

Hi Stephane,

I hope you're doing well.

Starting the scaling at 0 may prevent the autoscaling from working as expected. The cluster autoscaler component watches for pods that can't be scheduled due to resource constraints. If there are no unscheduled pods, the autoscaler may not trigger the scale-up operation. It is recommended to start the scaling at 1 to ensure proper functioning of the autoscaling mechanism.

if my understanding is correct I probably explained badly the problem, so I replied:

Hi All,

I hope you're doing well.

`The cluster autoscaler component watches for pods that can't be scheduled due to resource constraints.` this is exactly the problem, arm64 pods cannot be schedule as there is no node in the nodepool. So I expect the autoscaler to spawn one node.
Starting the scaling from 1 mean spending more money, with a node for nothing for part of the time.

Stéphane

timja commented 8 months ago

We had similar problems with spot instances, if we had it set to 0 it didn't work, whereas 1 did.

smerle33 commented 8 months ago

We had similar problems with spot instances, if we had it set to 0 it didn't work, whereas 1 did.

Thanks @timja, I did tell the azure support team I had over on visio, they will look into it, will keep you informed here but it sound right that the spot instance are the trouble here.

dduportal commented 8 months ago

We had similar problems with spot instances, if we had it set to 0 it didn't work, whereas 1 did.

Thanks @timja, I did tell the azure support team I had over on visio, they will look into it, will keep you informed here but it sound right that the spot instance are the trouble here.

Given we can't afford (in the current subscription) to use non-spot instances, WDYT if we start working on adding a new AKS cluster only for infra.ci.jenkins.io and release.ci.jenkins.io Kubernetes agents in the new "sponsored" subscription.

Multiple achievement for us:

Less consumption in the current subscription, and we consume sponsored credits
Separation of concern between controllers and agents
No problem to use non-spot instances until MS solves the problem

If it make sense, I propose to close this issue and track the solution above in a new one (ping @smerle33 if you don't mind writing it).

The new issue would need to mention:

"Using subscription credits instead of CDF money" following https://github.com/jenkins-infra/helpdesk/issues/3818 and https://github.com/jenkins-infra/helpdesk/issues/3913
"Continue work about infra.ci on arm64" as part of https://github.com/jenkins-infra/helpdesk/issues/3823 => Let's scope the initial implementation to only infra.ci.jenkins.io agents, and only 1 "non system" nodepool of type linux/arm64 so we can start switching workloads out of privatek8s.

timja commented 8 months ago

Is there an issue to just leave 1 spot instance?

dduportal commented 8 months ago

Is there an issue to just leave 1 spot instance?

Cost 😅 (but might be OK though @smerle33 do you mind checking the cost?)

smerle33 commented 8 months ago

Is there an issue to just leave 1 spot instance?

Cost 😅 (but might be OK though @smerle33 do you mind checking the cost?)

I think its quite acceptable ... less than 15$/month

smerle33 commented 7 months ago

last answer from microsoft confirm our choice :

According to the microsoft guidelines,
One node should be present where there are unscheduled pods, so that cluster can further autoscale based on those unscheduled pods.
Starting the scaling at 0 may prevent the autoscaling from working as expected.
The cluster autoscaler component watches for pods that can't be scheduled due to resource constraints.
If there are no unscheduled pods, the autoscaler may not trigger the scale-up operation.
It is recommended to start the scaling at 1 to ensure proper functioning of the autoscaling mechanism.

We don't have any official documents for the spot instance.

smerle33 commented 7 months ago

new answer from microsoft :

One node should be present where there are unscheduled pods, so that cluster can further autoscale based on those unscheduled pods.
Starting the scaling at 0 may prevent the autoscaling from working as expected.
The cluster autoscaler component watches for pods that can't be scheduled due to resource constraints.
If there are no unscheduled pods, the autoscaler may not trigger the scale-up operation.
It is recommended to start the scaling at 1 to ensure proper functioning of the autoscaling mechanism.

lemeurherve commented 7 months ago

Isn't it exactly the same response as last week minus the last phrase? ^^

jenkins-infra / helpdesk