kubeflow / common

Common APIs and libraries shared by other Kubeflow operator repositories.
Apache License 2.0
51 stars 73 forks source link

Pod name using generated name #215

Closed yowenter closed 1 year ago

yowenter commented 1 year ago

If a job is recreated using the deleted job name, while the deleted job pods are still in Terminating state, the new job pod will create fail. So we'd better using generated pod name. I found the kubernetes Job implementation is also using the generated pod name.

google-oss-prow[bot] commented 1 year ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign gaocegege for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/kubeflow/common/blob/master/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
tenzen-y commented 1 year ago

/cc

gaocegege commented 1 year ago

Will it break the existing code? E.g. the TFConfig generation.

yowenter commented 1 year ago

Will it break the existing code? E.g. the TFConfig generation.

Tensorflow has override ReconcilePods func, the pod name is not generated, so it will be ok. However, the mxnet job has dependency on pod name. let me see how to fix it.

yowenter commented 1 year ago

hi @gaocegege , I found the mxnet implementation relies on the pod host,

// genClusterSpec will generate ClusterSpec.
func genClusterSpec(mxjob *kubeflowv1.MXJob) (ClusterSpec, error) {
    clusterSpec := make(ClusterSpec)

    for rtype, spec := range mxjob.Spec.MXReplicaSpecs {
        rt := strings.ToLower(string(rtype))
        replicaNames := make([]UrlPort, 0, *spec.Replicas)

        port, err := getPortFromMXJob(mxjob, rtype)
        if err != nil {
            return nil, err
        }
        for i := int32(0); i < *spec.Replicas; i++ {
            host := UrlPort{
                Url:  common.GenGeneralName(mxjob.Name, rt, fmt.Sprintf("%d", i)),
                Port: int(port),
            }
            replicaNames = append(replicaNames, host)
        }

        clusterSpec[rt] = replicaNames
    }

    return clusterSpec, nil
}

So, the mxnet job pod name must be specified. Maybe I need add a implementation mxnet createnewpod. How do you think about it ?

gaocegege commented 1 year ago

/cc @kubeflow/wg-training-leads

tenzen-y commented 1 year ago

Will it break the existing code? E.g. the TFConfig generation.

Tensorflow has override ReconcilePods func, the pod name is not generated, so it will be ok. However, the mxnet job has dependency on pod name. let me see how to fix it.

We are considering fully consolidating the tf-operator to the training-operator. So, this change will be affected on the TFJob.

https://github.com/kubeflow/training-operator/issues/1727

tenzen-y commented 1 year ago

Either way, we can improve handling the terminating Pod once we introduce batch/job API.

https://github.com/kubeflow/training-operator/issues/1718

yowenter commented 1 year ago

Either way, we can improve handling the terminating Pod once we introduce batch/job API.

kubeflow/training-operator#1718

@tenzen-y It's good if training operator reuse the kubernetes batchjob api. By the way, when will the refactored job feature be released ?

tenzen-y commented 1 year ago

By the way, when will the refactored job feature be released ?

As the training operator needs the elastic indexed job feature available since k8s v1.27, we will introduce the batch/job after K8s v1.26 reaches EoL.

tenzen-y commented 1 year ago

Also, I'm working on adding the success policy feature similar to the TFJob's success policy to the batch/job. However, I'm not sure when the success policy feature is graduated to beta. So, as a first step for introducing batch/job, it would be good to implement the feature, the same as currently TFJob on the traininig-operator side.

Moreover, I'm thinking of introducing the JobSet API instead of batch/job, although I think we need discussion whether we should introduce the JobSet API.

yowenter commented 1 year ago

Also, I'm working on adding the success policy feature similar to the TFJob's success policy to the batch/job. However, I'm not sure when the success policy feature is graduated to beta. So, as a first step for introducing batch/job, it would be good to implement the feature, the same as currently TFJob on the traininig-operator side.

Moreover, I'm thinking of introducing the JobSet API instead of batch/job, although I think we need discussion whether we should introduce the JobSet API.

Good, I'm closing this pr for now.