kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.6k stars 696 forks source link

Add DispatchServer and WorkerServer to TFJob #1529

Open lukepfister opened 2 years ago

lukepfister commented 2 years ago

The TFJob operator currently supports PS, Chief, Worker, and Evaluator.

Is there any appetite for adding new process types to cover DispatchServer and WorkerServer? I'm happy to contribute a PR.

zw0610 commented 2 years ago

That will be very helpful. Please file a PR at your convenience for this feature and let us know if any help needed.

terrytangyuan commented 2 years ago

I am a little concerned about the fact that these are still experimental. We may expect the removal of them at any point of time.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tenzen-y commented 1 year ago

/lifecycle frozen