A plugin container may fail at the initial phase where volumes are mounted and plugin image is being ready. Like,
State: Terminated
Reason: StartError
Message: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting "/var/lib/kubelet/pods/b74d16e6-eced-4932-81a7-c0e45a78b960/volume-subpaths/waggle-data-config/panda-rosbag-lid/1" to rootfs at "/run/waggle/data-config.json": mount /var/lib/kubelet/pods/b74d16e6-eced-4932-81a7-c0e45a78b960/volume-subpaths/waggle-data-config/panda-rosbag-lid/1:/run/waggle/data-config.json (via /proc/self/fd/6), flags: 0x5001: no such file or directory: unknown
Then, the Pod goes to StartError state,
panda-rosbag-lid-1622 1/2 StartError 0 6d9h
The scheduler does not receive such event in the watcher so that it believes the Pod is running. We need to improve this watch mechanism to catch those errors and properly restart it.
Another example of this problem is when plugin's container image needs to be pulled before running. This event is also missed in the scheduler such that the scheduler and users believe the plugin is running.
121 added some IF statements to capture those errors coming from the Kubernetes informer. When seen, the scheduler will mark the run as failed to move on. This is however not a solution to resolve the errors related to the system health such as "no space left", "contained error", etc but will at least inform users those failures and continue.
A plugin container may fail at the initial phase where volumes are mounted and plugin image is being ready. Like,
Then, the Pod goes to StartError state,
The scheduler does not receive such event in the watcher so that it believes the Pod is running. We need to improve this watch mechanism to catch those errors and properly restart it.
Another example of this problem is when plugin's container image needs to be pulled before running. This event is also missed in the scheduler such that the scheduler and users believe the plugin is running.