kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
110.54k stars 39.51k forks source link

Stateful Indexed Job #115066

Open ahg-g opened 1 year ago

ahg-g commented 1 year ago

What would you like to be added?

An automated way of creating PVCs for the individual pods of Indexed Job. This is similar to PersistentVolumeClaimTemplate in StatefulSets.

Why is this needed?

Many long running training workloads use checkpointing to recover from failures. In some cases, the individual workers create node-local checkpoints. To ensure that on failure the replacement pod restarts fast, it should be scheduled on the same node as the failed pod. At the same time, if the node itself fails and gone, the replacement pod is free to schedule on any available node.

/wg batch /sig storage /sig apps

k8s-ci-robot commented 1 year ago

@ahg-g: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
ahg-g commented 1 year ago

/cc @msau42 @mattcary

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 1 year ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes/kubernetes/issues/115066#issuecomment-1588116042): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
ahg-g commented 1 year ago

/re-open

ahg-g commented 1 year ago

/reopen

k8s-ci-robot commented 1 year ago

@ahg-g: Reopened this issue.

In response to [this](https://github.com/kubernetes/kubernetes/issues/115066#issuecomment-1640710199): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
msau42 commented 1 year ago

/remove-lifecycle rotten

tenzen-y commented 10 months ago

An automated way of creating PVCs for the individual pods of Indexed Job.

@ahg-g Which does this mean, a. "Create multiple PVCs per Index" or b. "Create a single PVC for all indexes"? Anyway, I believe it would be useful if we could select PVC creation mode a or b.

The b is useful for storing large datasets and pre-trained models and then sharing data to all indexes.

k8s-triage-robot commented 7 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

tenzen-y commented 7 months ago

/remove-lifecycle stale

k8s-triage-robot commented 4 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

tenzen-y commented 4 months ago

/remove-lifecycle stale

hunjixin commented 2 months ago

want this feature too