kubeflow / website

Kubeflow's public website
Creative Commons Attribution 4.0 International
145 stars 752 forks source link

How do you add other machines? This example just created replicas within a single machine? Is Kubeflow not capable of adding other machines? #3706

Open warmbasket opened 3 months ago

warmbasket commented 3 months ago

How do you add other machines? This example just created replicas within a single machine? Is Kubeflow not capable of adding other machines?

warmbasket commented 3 months ago

https://www.kubeflow.org/docs/components/training/overview/ "create a TFJob/PyTorchJob with required number PSs, workers, and GPUs using Training Operator Python SDK." But then it doesn't seem to be possible to add workers, or Ps from other machines using Training Operator Python SDK......just replicas within the single machine?

andreyvelich commented 3 months ago

@akrupien Please can you explain what do you mean by "add workers, or Ps from other machines" ? When you add more workers Training Operator will create more Kubernetes pods and those pods will be scheduled to the appropriate Kubernetes nodes. You can also specify Pod Node Selector if you want Pods to be assigned to the specific Kubernetes node (machine).

warmbasket commented 3 months ago

@andreyvelich Thank you, I think you sort of read my mind, I would want a pod to be assigned to a single machine/computer. In my case I would want the pod to be assigned to the entire machine/computer.

By "add workers, or Ps from other machines", I mean I have multiple computers/machines, each has multiple GPU's, and each computer/machine should be their own worker.

When I create a TFJob with the required number of workers using Training Operator, I'd expect it should match my TF config in my Tensorflow distributive training? So I should be able to add my individual computers/machines as workers?

I am using MultiWorkerMirroredStrategy in my Tensorflow distributive training with multiple computers/machines. Each Computer/Machine is their own worker.

https://www.kubeflow.org/docs/components/training/tftraining/ Tf Replica Spec in Training Operator SDK Doesn't seem to provide an option for adding individual machines as workers - only replicas within a single machine? But maybe I'm missing something under Spec if TFReplica spec is not necessary for TFJob.

Your link seems to assign a pod to a node. Is it possible in my situation to use pod affinity to add my multiple workers/computers/machines in TFJob?

In my situation a Node is a Machine which Is a Indidividual Computer which is it's own single pod.

I am essentially asking how to use Training Operator to add my workers/computers/machines as pods, to their node.

Ideally, my entire cluster would be a single pod but that doesn't seem possible.