kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.6k stars 696 forks source link

[feature] Can we use one headless service for one job? #1030

Open gaocegege opened 5 years ago

gaocegege commented 5 years ago

We have ps/worker/chief for one TFJob. And now we create one headless service for one replica. I think we can use one headless service for easy-to-use.

After that, we could use {tfjob_name}-{replica_type}-{index}.{service_name}.svc.cluster.local in the code.

WDYT @johnugeorge @richardsliu

issue-label-bot[bot] commented 5 years ago

Issue-Label Bot is automatically applying the label improvement/enhancement to this issue, with a confidence of 0.70. Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

jtfogarty commented 4 years ago

/area engprod /priority p2

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tenzen-y commented 1 year ago

/reopen We should take this to improve cluster performance.

google-oss-prow[bot] commented 1 year ago

@tenzen-y: Reopened this issue.

In response to [this](https://github.com/kubeflow/training-operator/issues/1030#issuecomment-1640940083): >/reopen >We should take this to improve cluster performance. > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
tenzen-y commented 1 year ago

I realized this need by Aldo's comment.

cc: @kubeflow/wg-training-leads

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tenzen-y commented 1 year ago

/lifecycle frozen

kannon92 commented 6 months ago

@tenzen-y brought this up in brainstorming around jobset/kubeflow.

We have implemented a few ways to customize network names.

kannon92 commented 6 months ago
type Network struct {
    // EnableDNSHostnames allows pods to be reached via their hostnames.
    // Pods will be reachable using the fully qualified pod hostname:
    // <jobSet.name>-<spec.replicatedJob.name>-<job-index>-<pod-index>.<subdomain>
    // +optional
    EnableDNSHostnames *bool `json:"enableDNSHostnames,omitempty"`

    // Subdomain is an explicit choice for a network subdomain name
    // When set, any replicated job in the set is added to this network.
    // Defaults to <jobSet.name> if not set.
    // +optional
    Subdomain string `json:"subdomain,omitempty"`
}

Was what we used to control service creation for the jobset.

gaocegege commented 6 months ago

The suffix will differ from .svc.cluster.local according to the cluster settings. Maybe we could use a CLI parameter to config it.