volcano-sh / volcano

A Cloud Native Batch System (Project under CNCF)
https://volcano.sh
Apache License 2.0
4.23k stars 969 forks source link

[Proposal] Add podgroup statistics doc #3750

Closed JesseStutler closed 3 weeks ago

JesseStutler commented 1 month ago

fix #3597

Backgrounds

Each time when podgroups states changed, the controller will update the statistics of podgroup of each state in the queue's status. And at the end of each scheduling session, the volcano scheduler will also update the allocated filed in queue's status to recored the amount of the amount of resources allocated. Both components use UpdateStatus api to update the queue status, which will cause conflict errors. When the controller encounter such an error, it will trigger AddRateLimited to push back the podgroup into work queue, resulting in accumulation of memory leak. See in issue #3597.

Alternative

Currently the statistics of podgroup of eatch state are only used for display by vcctl, there is no need to be persisted in queue's status. So when users need to use vcctl queue get -n [name] or vcctl list to display queues and each state of podgroups in queue, vcctl should calculate podgroup statistics in client side and then display them. And we can export these statistics of podgroups in each state as metrics.

Implementation

JesseStutler commented 1 month ago

/cc @hwdef @Monokaix @lowang-bh Please take a look~ I will push the other relating pr lately.

Monokaix commented 1 month ago

Should also add that queue's pg statistics are still maintained in queue cache because these data is needed when close a queue.

JesseStutler commented 3 weeks ago

Should also add that queue's pg statistics are still maintained in queue cache because these data is needed when close a queue.

I added a notice item to record this, please review it again.

Monokaix commented 3 weeks ago

/lgtm /approve

volcano-sh-bot commented 3 weeks ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Monokaix

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/volcano-sh/volcano/blob/master/OWNERS)~~ [Monokaix] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment