project-akri / akri

A Kubernetes Resource Interface for the Edge
https://docs.akri.sh/
Apache License 2.0
1.11k stars 144 forks source link

Pods with unready Containers exist on this node, we can't clean the slots yet #681

Open rpieczon opened 11 months ago

rpieczon commented 11 months ago

Describe the bug

Akri agent daemonset keeps reporting following error whenever any of pod running on a cluster is not ready.

2023-11-16T13:44:46Z TRACE agent::util::slot_reconciliation] reconcile - Pods with unready Containers exist on this node, we can't clean the slots yet

In my case failing POD doesn't use USB resources.

Output of kubectl get pods,akrii,akric -o wide

lpfe04@f1725b929a:~$ kubectl get pod,akrii,akric -n akri
NAME                                              READY   STATUS    RESTARTS   AGE
pod/akri-agent-daemonset-9gwl2                    1/1     Running   0          10m
pod/akri-controller-deployment-7c6455f79-zt779    1/1     Running   0          11m
pod/akri-udev-discovery-daemonset-2d9hp           1/1     Running   0          10m
pod/akri-webhook-configuration-7bf6656b45-mclth   1/1     Running   0          11m

NAME                                  CONFIG        SHARED   NODES            AGE
instance.akri.sh/gsm-dongle-6e977d    gsm-dongle    false    ["f1725b929a"]   10m
instance.akri.sh/wifi-dongle-254c38   wifi-dongle   false    ["f1725b929a"]   10m
instance.akri.sh/wifi-dongle-ac917e   wifi-dongle   false    ["f1725b929a"]   10m

NAME                                CAPACITY   AGE
configuration.akri.sh/gsm-dongle    1          11h
configuration.akri.sh/wifi-dongle   1          11h

Kubernetes Version: [e.g. Native Kubernetes 1.19, MicroK8s 1.19, Minikube 1.19, K3s]

kubernetes: v1.26.8+rke2r1"

Expected behavior

I would expect reconciliation process can be continue if failing pod is out of usb usage.management context.

kate-goldenring commented 10 months ago

@rpieczon just to clarify, are you saying if any pod (even if unassociated with Akri) is unready, it causes this slot reconciliation error? From what i remember slot reconciliation should only check pods with an expected annotation.

bfjelds commented 10 months ago

i lose track a little, but the annotations are on the container, not the pod i think ... and it might be that an unready pod is considered a potential place where an annotated container could eventually exist. might be worth looking at the resource requests to limit where this early exit happens.

bfjelds commented 10 months ago

might be hard to check for the resource though. if the pod isn't ready and the container doesn't exist, there isn't much context to check the instances against.

rpieczon commented 10 months ago

@rpieczon just to clarify, are you saying if any pod (even if unassociated with Akri) is unready, it causes this slot reconciliation error? From what i remember slot reconciliation should only check pods with an expected annotation.

Exactly in my case I have failing Prometheus POD which has zero requirements related with USB allocation.

rpieczon commented 8 months ago

Any update on it?