volcano-sh / volcano

A Cloud Native Batch System (Project under CNCF)
https://volcano.sh
Apache License 2.0
4.11k stars 949 forks source link

Garbage Collector recalculates `job.Status.RunningDuration` after restarting controller #3715

Closed matbme closed 3 days ago

matbme commented 2 weeks ago

Description

When restarting a volcano-controllers-xxxx pod, the Garbage Collector seems to check if previously completed jobs are ready for cleanup and "kills" them again, thus resulting in it updating the job.Status.RunningDuration variable, even though the job had already been completed before.

I use the job duration variable to indicate to a user how long a task has taken after it was completed, which ends up changing every time the controller is restarted due to some issue.

Controller logs ``` I0910 13:24:49.411220 1 job_controller.go:269] worker 2 start ...... I0910 13:24:49.411222 1 job_controller.go:269] worker 1 start ...... I0910 13:24:49.411236 1 job_controller.go:269] worker 0 start ...... I0910 13:24:49.411234 1 job_controller.go:320] Try to handle request I0910 13:24:49.411234 1 job_controller.go:320] Try to handle request I0910 13:24:49.411296 1 job_controller.go:337] Execute on Job in by <*state.finishedState>. I0910 13:24:49.411295 1 job_controller.go:337] Execute on Job in by <*state.finishedState>. I0910 13:24:49.411306 1 job_controller_actions.go:48] Killing Job , current version 8 I0910 13:24:49.411307 1 job_controller_actions.go:48] Killing Job , current version 8 I0910 13:24:49.411351 1 job_controller_actions.go:122] Running duration is 20h3m37.41134596s I0910 13:24:49.411351 1 job_controller_actions.go:122] Running duration is 21h8m12.411345963s W0910 13:24:49.421976 1 warnings.go:70] unknown field "spec.tasks[0].template.metadata.creationTimestamp" I0910 13:24:49.423087 1 garbagecollector.go:124] Updating job c774da7d-aff4-42cb-9eb0-f59912acecbb/test-log-spam I0910 13:24:49.423102 1 garbagecollector.go:132] Add job c774da7d-aff4-42cb-9eb0-f59912acecbb/test-log-spam to cleanup I0910 13:24:49.423117 1 garbagecollector.go:191] Checking if Job c774da7d-aff4-42cb-9eb0-f59912acecbb/test-log-spam is ready for cleanup W0910 13:24:49.423088 1 warnings.go:70] unknown field "spec.tasks[0].template.metadata.creationTimestamp" I0910 13:24:49.423129 1 garbagecollector.go:290] Found Job c774da7d-aff4-42cb-9eb0-f59912acecbb/test-log-spam finished at 2024-09-09 17:23:44 +0000 UTC, remaining TTL 51h58m54.57687801s since 2024-09-10 13:24:49.42312199 +0000 UTC, TTL will expire at 2024-09-12 17:23:44 +0000 UTC I0910 13:24:49.423814 1 garbagecollector.go:124] Updating job c774da7d-aff4-42cb-9eb0-f59912acecbb/test-workload-4 I0910 13:24:49.423828 1 garbagecollector.go:132] Add job c774da7d-aff4-42cb-9eb0-f59912acecbb/test-workload-4 to cleanup I0910 13:24:49.423837 1 garbagecollector.go:191] Checking if Job c774da7d-aff4-42cb-9eb0-f59912acecbb/test-workload-4 is ready for cleanup I0910 13:24:49.423846 1 garbagecollector.go:290] Found Job c774da7d-aff4-42cb-9eb0-f59912acecbb/test-workload-4 finished at 2024-09-09 16:19:24 +0000 UTC, remaining TTL 50h54m34.576158806s since 2024-09-10 13:24:49.423841194 +0000 UTC, TTL will expire at 2024-09-12 16:19:24 +0000 UTC ```

Steps to reproduce the issue

  1. Create a vcjob and wait for it to complete
  2. Observe Running Duration in kubectl describe vcjob ...
  3. Restart the Volcano controller
  4. Observe Running Duration in kubectl describe vcjob ... again, the value has changed even though the job was already completed

Describe the results you received and expected

If I understand it correctly, Running Duration should indicate how long a job took from start until completion, so this variable shouldn't change after the controller is restarted.

What version of Volcano are you using?

1.8.0

Any other relevant information

I tracked down the issue to this line in the Garbage Collector which should probably use the job's end time instead of the current time.

However, I'm not entirely sure killJob should be called at all given that it has already been killed before.

Monokaix commented 2 weeks ago

Seems killJob truly calculates a wrong time duration, I think the right way is using the time in complete condition - the creationtime.

Monokaix commented 2 weeks ago

/good-first-issue

volcano-sh-bot commented 2 weeks ago

@Monokaix: This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-good-first-issue command.

In response to [this](https://github.com/volcano-sh/volcano/issues/3715): >/good-first-issue Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.