When running Volcano Scheduler, I encountered repeated "podgroup not found" errors in the SyncQueue action. The logs indicate that the controller is trying to sync the queue, but it cannot find the referenced podgroup.scheduling.volcano.sh, resulting in an "OutOfSync" event. Here are the relevant logs:
I1108 02:53:40.178456 1 queue_controller.go:245] Begin execute SyncQueue action for queue default, current status Open I1108 02:53:40.178479 1 queue_controller_action.go:35] Begin to sync queue default. I1108 02:53:40.178495 1 queue_controller_action.go:48] End sync queue default. I1108 02:53:40.178503 1 queue_controller.go:227] Finished syncing queue default (59.49µs). I1108 02:53:40.178525 1 queue_controller.go:269] Dropping queue request Queue: default, Job: /, Task:, Event:OutOfSync, ExitCode:0, Action:SyncQueue, JobVersion: 0 out of the queue for sync queue default failed for podgroup.scheduling.volcano.sh "u-28-1217-829wu6ok5s2xpvtc9dw90lao7" not found, event is OutOfSync, action is SyncQueue.
This error appears consistently for different podgroup IDs after job completion.
Our setup includes a script that automatically deletes jobs upon completion, running every 3 seconds.
Steps to reproduce the issue
Run a queue in Volcano with multiple jobs.
Allow the jobs to complete, triggering automatic job deletion every 3 seconds.
Monitor the controller logs for SyncQueue actions.
Describe the results you received and expected
The controller should sync the queue without encountering repeated "podgroup not found" errors after jobs complete and are cleared.
What version of Volcano are you using?
v1.10.0
Any other relevant information
After the cluster has been running for some time, I observe that certain podgroups become pending forever, with event indicating "queue resource quota insufficient" even though there is actually enough available resource in the cluster. Restarting the scheduler allows it to run normally again, temporarily resolving the issue.
Could this "queue resource quota insufficient" issue be related to the repeated SyncQueue errors and potentially outdated or inconsistent resource status within the queue?
Description
When running Volcano Scheduler, I encountered repeated "podgroup not found" errors in the SyncQueue action. The logs indicate that the controller is trying to sync the queue, but it cannot find the referenced podgroup.scheduling.volcano.sh, resulting in an "OutOfSync" event. Here are the relevant logs:
I1108 02:53:40.178456 1 queue_controller.go:245] Begin execute SyncQueue action for queue default, current status Open I1108 02:53:40.178479 1 queue_controller_action.go:35] Begin to sync queue default. I1108 02:53:40.178495 1 queue_controller_action.go:48] End sync queue default. I1108 02:53:40.178503 1 queue_controller.go:227] Finished syncing queue default (59.49µs). I1108 02:53:40.178525 1 queue_controller.go:269] Dropping queue request Queue: default, Job: /, Task:, Event:OutOfSync, ExitCode:0, Action:SyncQueue, JobVersion: 0 out of the queue for sync queue default failed for podgroup.scheduling.volcano.sh "u-28-1217-829wu6ok5s2xpvtc9dw90lao7" not found, event is OutOfSync, action is SyncQueue.
This error appears consistently for different podgroup IDs after job completion. Our setup includes a script that automatically deletes jobs upon completion, running every 3 seconds.
Steps to reproduce the issue
Describe the results you received and expected
The controller should sync the queue without encountering repeated "podgroup not found" errors after jobs complete and are cleared.
What version of Volcano are you using?
v1.10.0
Any other relevant information
After the cluster has been running for some time, I observe that certain podgroups become pending forever, with event indicating "queue resource quota insufficient" even though there is actually enough available resource in the cluster. Restarting the scheduler allows it to run normally again, temporarily resolving the issue. Could this "queue resource quota insufficient" issue be related to the repeated SyncQueue errors and potentially outdated or inconsistent resource status within the queue?