Open bharathappali opened 6 months ago
Is master not up ? random-exp-jw6qxmrm-master-0 doesn't resolve
Yes the master pod is not getting scheduled. I see workers init failure and it shows crashloopbackoff
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@bharathappali Sorry for the late reply, can you try to create your PyTorchJob without Katib Experiment ?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I'm trying to run training operator standalone on openshift cluster with katib. When I apply a pytorch job the worker pods are getting created but for some reason the master pods are not getting started.
Here is the events log of the worker pod:
I have changed the init container image due to docker pull limits issue
Here is the pod log:
Here is the pytorch experiment I'm deploying