volcano-sh / volcano

A Cloud Native Batch System (Project under CNCF)
https://volcano.sh
Apache License 2.0
4.22k stars 969 forks source link

Controller reports "podgroup not found" repeatedly for SyncQueue action #3805

Open lixajh opened 6 days ago

lixajh commented 6 days ago

Description

When running Volcano Scheduler, I encountered repeated "podgroup not found" errors in the SyncQueue action. The logs indicate that the controller is trying to sync the queue, but it cannot find the referenced podgroup.scheduling.volcano.sh, resulting in an "OutOfSync" event. Here are the relevant logs: I1108 02:53:40.178456 1 queue_controller.go:245] Begin execute SyncQueue action for queue default, current status Open I1108 02:53:40.178479 1 queue_controller_action.go:35] Begin to sync queue default. I1108 02:53:40.178495 1 queue_controller_action.go:48] End sync queue default. I1108 02:53:40.178503 1 queue_controller.go:227] Finished syncing queue default (59.49µs). I1108 02:53:40.178525 1 queue_controller.go:269] Dropping queue request Queue: default, Job: /, Task:, Event:OutOfSync, ExitCode:0, Action:SyncQueue, JobVersion: 0 out of the queue for sync queue default failed for podgroup.scheduling.volcano.sh "u-28-1217-829wu6ok5s2xpvtc9dw90lao7" not found, event is OutOfSync, action is SyncQueue.

This error appears consistently for different podgroup IDs after job completion. Our setup includes a script that automatically deletes jobs upon completion, running every 3 seconds.

Steps to reproduce the issue

  1. Run a queue in Volcano with multiple jobs.
  2. Allow the jobs to complete, triggering automatic job deletion every 3 seconds.
  3. Monitor the controller logs for SyncQueue actions.

Describe the results you received and expected

The controller should sync the queue without encountering repeated "podgroup not found" errors after jobs complete and are cleared.

What version of Volcano are you using?

v1.10.0

Any other relevant information

After the cluster has been running for some time, I observe that certain podgroups become pending forever, with event indicating "queue resource quota insufficient" even though there is actually enough available resource in the cluster. Restarting the scheduler allows it to run normally again, temporarily resolving the issue. Could this "queue resource quota insufficient" issue be related to the repeated SyncQueue errors and potentially outdated or inconsistent resource status within the queue?

Monokaix commented 1 day ago

Hi, there is a related pr is doing the refactor and probably solve your problem: https://github.com/volcano-sh/volcano/pull/3751