Closed sergii-auctane closed 1 month ago
This issue is currently awaiting triage.
If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
Chart Version: 1.0.0, but karpenter image version is 1.0.6
The commit in your logs, 5bdf9c3
, indicates that you're still running v1.0.0
, not v1.0.6
. How did you go about updating the controller?
@jmdeal By setting image tag to 1.0.6
spec:
...
containers:
- env:
...
- name: NEW_RELIC_METADATA_KUBERNETES_CONTAINER_IMAGE_NAME
value: public.ecr.aws/karpenter/controller:1.0.6@sha256:1eb1073b9f4ed804634aabf320e4d6e822bb61c0f5ecfd9c3a88f05f1ca4c5c5
...
image: public.ecr.aws/karpenter/controller:1.0.6@sha256:1eb1073b9f4ed804634aabf320e4d6e822bb61c0f5ecfd9c3a88f05f1ca4c5c5
imagePullPolicy: IfNotPresent
UPD I just realized there is a digest: sha256:1eb1073b9f4ed804634aabf320e4d6e822bb61c0f5ecfd9c3a88f05f1ca4c5c5
presents as well and it hasn't changed.
I set controller.image.digest to null. I assume we can close this then.
Thanks
I would recommend upgrading the chart and not just the image. The versions are coupled and there can be changes to the chart on patch versions (we've made a few updates on 1.0 w.r.t. the conversion webhooks). This may work fine, but I'd upgrade the entire chart as a best practice.
Thanks. I wasn't able to find it initially, but then realised it was released not under the tag 1.0.6 or main branch, but in release-v1.0.6 branch
Description
We have 5 different node pools, 2 of them are for jobs, we run almost all the jobs in those nodepools. So when no job pods we don't have any nodes in thouse node pools. Thus in this node pools we have continuous rotation of nodes. Karppenter can create up to 20 nodes then terminate them create another 10 nodes in 5 minutes and terminate 5 of them, then create another 10 and so on.
Observed Behavior: Sometimes, karpenter gets stuck and stops creating nodes because of broken nodeclaims.
It starts creating new nodes after i remove finalizers in the broken nodeclaim manually. I run karpenter in 5 clusters, and this issue appears randomly in some if them. I hoped that fix, mentioned it this issue would help, but it didn't.
As a temporary fix, I have a job that removes finalizers for the broken nodeclaims, but this job does not take into account node claims with a null value in the NODE field. I did this intentionally, as new nodeclaims do not have a node name in the NODE field. So i might need to extend it and remove finalizers for nodeclaim older than 10 minutes without node assigned.
Expected Behavior: I expected I would not need to have the job that removes finalizers for broken nodeclaims and karpenter would do it instead. Or at least it would continue creating new nodes when the issue happens.
Reproduction Steps (Please include YAML): I don't know how to reproduce it; the most obvious way is to create a cluster with nodepool and run several cronjobs, which bring 2k-9k pods at each 0,15,30 minutes to force nodes rotation.
Versions:
Chart Version: 1.0.0, but karpenter image version is 1.0.6
Kubernetes Version (
kubectl version
): 1.29 (eks)Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment