waggle-sensor / edge-scheduler

Waggle Edge Scheduler
3 stars 1 forks source link

pending plugins do not get removed after deleting the job #112

Closed gemblerz closed 9 months ago

gemblerz commented 11 months ago

A plugin can be in Pending state due to many reasons, e.g. resource request exceeds 100%. Job of a plugin gets removed while the plugin is in that state. But, the pended plugin does not get removed by the node scheduler.

I think this is a timing issue; the job gets removed from the scheduler while the scheduler reattempts to connect to the plugin's event watcher. The scheduler removes the Go function that watches the plugin event and the actual Pod is left unattended.

gemblerz commented 10 months ago

This problem occurred on W05E,

INFO: 2023/12/18 23:34:28 resourcemanager.go:1348: attemping to re-connect for pod "panda-dbaserh-1936"
ERROR: 2023/12/18 23:34:38 http.go:185: Failed to subscribe "/api/v1/goals/W05E/stream": Streaming encountered EOF or an error and considered as closed
INFO: 2023/12/18 23:34:43 http.go:191: Retrying to connect to "/api/v1/goals/W05E/stream" in 5 seconds...
ERROR: 2023/12/18 23:37:58 resourcemanager.go:1663: Failed on watcher "waggle-plugin-scheduler-goals": Watcher is closed
INFO: 2023/12/18 23:38:03 nodescheduler.go:319: The goal imagesampler-dawn exists and no changes in the goal. Skipping adding the goal
ERROR: 2023/12/18 23:40:27 resourcemanager.go:1333: Watcher of the plugin panda-dbaserh-agg-1937 is unexpectedly closed. 
INFO: 2023/12/18 23:40:27 resourcemanager.go:1348: attemping to re-connect for pod "panda-dbaserh-agg-1937"
ERROR: 2023/12/18 23:49:58 resourcemanager.go:1333: Watcher of the plugin panda-calib-1938 is unexpectedly closed. 
INFO: 2023/12/18 23:49:58 resourcemanager.go:1348: attemping to re-connect for pod "panda-calib-1938"
ERROR: 2023/12/18 23:50:26 http.go:185: Failed to subscribe "/api/v1/goals/W05E/stream": Streaming encountered EOF or an error and considered as closed
INFO: 2023/12/18 23:50:31 http.go:191: Retrying to connect to "/api/v1/goals/W05E/stream" in 5 seconds...
ERROR: 2023/12/18 23:51:00 resourcemanager.go:1298: Plugin "panda-dbaserh-1936" has failed
ERROR: 2023/12/18 23:51:00 resourcemanager.go:1302: failed to get plugin "panda-dbaserh-1936" container "plugin-controller" log: the server could not find the requested resource ( pods/log panda-dbaserh-1936)
ERROR: 2023/12/18 23:51:00 nodescheduler.go:222: Could not get goal to update plugin status: "The goal ID f362e53d-077b-4677-4a12-4ac9417a08c6 does not exist"
ERROR: 2023/12/18 23:51:48 resourcemanager.go:1333: Watcher of the plugin panda-rosbag-radenv-1939 is unexpectedly closed. 
INFO: 2023/12/18 23:51:48 resourcemanager.go:1348: attemping to re-connect for pod "panda-rosbag-radenv-1939"

The appropriate action would be to remove those Pods once the scheduler notices that their associated goal no longer exists.

gemblerz commented 9 months ago

https://github.com/waggle-sensor/edge-scheduler/pull/114 resolves this issue. The root cause was that a plugin Pod name has a suffix of its JobID. The cleanUpGoal function didn't have the correct name to clean up the plugin.