volcano-sh / volcano

A Cloud Native Batch System (Project under CNCF)
https://volcano.sh
Apache License 2.0
4.06k stars 936 forks source link

Garbage Collector is supposed to clean up Aborted volcanojobs #3604

Closed Antsypc closed 1 month ago

Antsypc commented 1 month ago

What happened: garbagecollector controller does not clean up any Aborted volcanojobs where the .status.state.phase is Aborted.

What you expected to happen: Aborted jobs should be cleaned up.

How to reproduce it (as minimally and precisely as possible):

  1. create a volcanojob with .spec.ttlSecondsAfterFinished specified.
  2. abort this job.
  3. kubectl get vj xxx, you will see that thee job's .status.state.phase is Aborted.
  4. wait for ttlSecondsAfterFinished seconds.
  5. however, this job still exists.

Anything else we need to know?: The bug seems to occur in the following code. https://github.com/volcano-sh/volcano/blob/v1.10.0-alpha.0/pkg/controllers/garbagecollector/garbagecollector.go#L266

func isJobFinished(job *v1alpha1.Job) bool {
    return job.Status.State.Phase == v1alpha1.Completed ||
        job.Status.State.Phase == v1alpha1.Failed ||
        job.Status.State.Phase == v1alpha1.Terminated
}

The condition job.Status.State.Phase == v1alpha1.Aborted should not be omitted.

Environment:

Monokaix commented 1 month ago

I think it's reasonable: )

Monokaix commented 1 month ago

/good-first-issue

volcano-sh-bot commented 1 month ago

@Monokaix: This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-good-first-issue command.

In response to [this](https://github.com/volcano-sh/volcano/issues/3604): >/good-first-issue Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
kyooosukedn commented 1 month ago

Would like to try on this :)

googs1025 commented 1 month ago

/assign

googs1025 commented 1 month ago

Would like to try on this :)

@kyooosukedn Oh! Do you want to try this topic? Then I'll give it to you first. I can help review it.

googs1025 commented 1 month ago

/assign @kyooosukedn

kyooosukedn commented 1 month ago

3637

googs1025 commented 1 month ago

I think the key question is: how do we distinguish between "Aborted", "Terminated", and "Failed"? Are these three states defined as the final state of the Job? I can't easily distinguish the difference between these three states of the Job in the code. Why do we need to divide them into three?

googs1025 commented 1 month ago

So we must confirm one thing, which states are the final states. And these final states can be recycled by Garbage Collector

Monokaix commented 1 month ago

@Antsypc Can you describe that under what will you abort a job,and can the job be resumed by command vcctl job resume?

Monokaix commented 1 month ago
image

Aha, aborted job can be resumed: )

googs1025 commented 1 month ago
image

Aha, aborted job can be resumed: )

Based on this result, I think the aborted state is not a final state.

Monokaix commented 1 month ago
image

Aha, aborted job can be resumed: )

Based on this result, I think the aborted state is not a final state.

Yeah!

Monokaix commented 1 month ago

@Antsypc Aborted is not a final state and can be resumed so we will close this.

Monokaix commented 1 month ago

/close

volcano-sh-bot commented 1 month ago

@Monokaix: Closing this issue.

In response to [this](https://github.com/volcano-sh/volcano/issues/3604#issuecomment-2261758131): >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.