kubeflow / pytorch-operator

PyTorch on Kubernetes
Apache License 2.0
307 stars 143 forks source link

gang schedule bug #186

Closed zlcnju closed 5 years ago

zlcnju commented 5 years ago

Now, the pytorch operator will not create worker pod until master pod is running. However, when use kube-batch, it will schedule until all pods created. So now, when use gang schedule, the master will always in the pending state.

The quick fix is exclude the master pod in the batch. This way, the master is first created, and the podgroup and worker pod will be created after the master pod is running. In this solution, there is possibility that some master pod consume all resources.

Another solution is add init container in the worker pod. In this solution, all pods(master and worker) will created together. But in worker pod, it's init container will check master pod ready . This solution is better, but it will bring in an init container.

Any other idea? I will work on this if discussion is done.

zlcnju commented 5 years ago

@johnugeorge

johnugeorge commented 5 years ago

@zlcnju Thanks for finding this issue.

Now, the pytorch operator will not create worker pod until master pod is running. However, when use kube-batch, it will schedule until all pods created. So now, when use gang schedule, the master will always in the pending state.

The quick fix is exclude the master pod in the batch. This way, the master is first created, and the podgroup and worker pod will be created after the master pod is running. In this solution, there is possibility that some master pod consume all resources.

Another solution is add init container in the worker pod. In this solution, all pods(master and worker) will created together. But in worker pod, it's init container will check master pod ready . This solution is better, but it will bring in an init container.

Instead of worker pods waiting to get created before the master pod is running, you propose to have worker pods created together with master pods but wait till it is running(via init containers to check master pod readiness). Did I get it right?

Any other idea? I will work on this if discussion is done.

Solution 2 looks ok to me. I would wait for @gaocegege and @richardsliu 's comments.

johnugeorge commented 5 years ago

one other quick fix would be skip the workers waiting to get created if gang scheduling is enabled. (https://github.com/kubeflow/pytorch-operator/blob/master/pkg/controller.v1/pytorch/pod.go#L60)

gaocegege commented 5 years ago

Thanks for your issue

the pytorch operator will not create worker pod until master pod is running.

Is it a limitation of pytorch?

johnugeorge commented 5 years ago

Thanks for your issue

the pytorch operator will not create worker pod until master pod is running.

Is it a limitation of pytorch?

Yes. Pytorch workers will crash if master is not up. Reference: https://github.com/kubeflow/pytorch-operator/issues/125

zlcnju commented 5 years ago

one other quick fix would be skip the workers waiting to get created if gang scheduling is enabled. (https://github.com/kubeflow/pytorch-operator/blob/master/pkg/controller.v1/pytorch/pod.go#L60)

if skip the workers waiting to get created if gang scheduling is enabled, the order in gang schedule will not be guaranteed The quick fix i means is that exclude the master in the batch and not add kube-batch annotation for master https://github.com/kubeflow/pytorch-operator/blob/6aa39a411eb9fe5e62b54b0a3cf09225abe7d551/pkg/controller.v1/pytorch/controller.go#L440

zlcnju commented 5 years ago

Instead of worker pods waiting to get created before the master pod is running, you propose to have worker pods created together with master pods but wait till it is running(via init containers to check master pod readiness). Did I get it right?

@johnugeorge you got me

gaocegege commented 5 years ago

The quick fix i means is that exclude the master in the batch and not add kube-batch annotation for master

Then how can we get the benefit from gang scheduling?

gaocegege commented 5 years ago

Solution 2 looks ok to me.

johnugeorge commented 5 years ago

one other quick fix would be skip the workers waiting to get created if gang scheduling is enabled. (https://github.com/kubeflow/pytorch-operator/blob/master/pkg/controller.v1/pytorch/pod.go#L60)

if skip the workers waiting to get created if gang scheduling is enabled, the order in gang schedule will not be guaranteed

What do you mean by "Order will not guaranteed"? I meant to say that if gang scheduling is enabled, don't do explicit check of master being running before the worker pods. ie, start all pods immediately(like the initial implementation).

The quick fix i means is that exclude the master in the batch and not add kube-batch annotation for master

https://github.com/kubeflow/pytorch-operator/blob/6aa39a411eb9fe5e62b54b0a3cf09225abe7d551/pkg/controller.v1/pytorch/controller.go#L440

I think, this will be a hack. Solution 2 is a better solution.

zlcnju commented 5 years ago

The quick fix i means is that exclude the master in the batch and not add kube-batch annotation for master

Then how can we get the benefit from gang scheduling?

The benefit is only for the worker pod this way

zlcnju commented 5 years ago

What do you mean by "Order will not guaranteed"? I meant to say that if gang scheduling is enabled, don't do explicit check of master being running before the worker pods. ie, start all pods immediately(like the initial implementation).

No matter the gang scheduling is the enabled, The worker pod should be created after the master pod is running.

zlcnju commented 5 years ago

If no objections, i will work on solution 2

johnugeorge commented 5 years ago

Sure.

YesterdayxD commented 5 years ago

@zlcnju i have same problem ,but i don‘’t understand your Solution 2 .Can you describe that what init containor dose? How do I write my yaml ?

zlcnju commented 5 years ago

@zlcnju i have same problem ,but i don‘’t understand your Solution 2 .Can you describe that what init containor dose? How do I write my yaml ?

in solution 2, all pods start up the same time. And the order in master and worker will be ensured by init container. In the pr, the yaml is the same. But if you want your custom init container, just modify the configmap pytorch-operator