waggle-sensor / edge-scheduler

Waggle Edge Scheduler
3 stars 1 forks source link

StartError not notified to the scheduler when running Plugin #117

Closed gemblerz closed 1 month ago

gemblerz commented 9 months ago

A plugin container may fail at the initial phase where volumes are mounted and plugin image is being ready. Like,

State:          Terminated
      Reason:       StartError
      Message:      failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting "/var/lib/kubelet/pods/b74d16e6-eced-4932-81a7-c0e45a78b960/volume-subpaths/waggle-data-config/panda-rosbag-lid/1" to rootfs at "/run/waggle/data-config.json": mount /var/lib/kubelet/pods/b74d16e6-eced-4932-81a7-c0e45a78b960/volume-subpaths/waggle-data-config/panda-rosbag-lid/1:/run/waggle/data-config.json (via /proc/self/fd/6), flags: 0x5001: no such file or directory: unknown

Then, the Pod goes to StartError state,

panda-rosbag-lid-1622      1/2     StartError   0          6d9h

The scheduler does not receive such event in the watcher so that it believes the Pod is running. We need to improve this watch mechanism to catch those errors and properly restart it.

Another example of this problem is when plugin's container image needs to be pulled before running. This event is also missed in the scheduler such that the scheduler and users believe the plugin is running.

gemblerz commented 1 month ago

121 added some IF statements to capture those errors coming from the Kubernetes informer. When seen, the scheduler will mark the run as failed to move on. This is however not a solution to resolve the errors related to the system health such as "no space left", "contained error", etc but will at least inform users those failures and continue.