kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.6k stars 697 forks source link

Evaluate usage of upstream Job API #1303

Open alculquicondor opened 3 years ago

alculquicondor commented 3 years ago

Currently, the training operators manage plain Pods. Arguably, managing plain Pods is difficult, as exemplified by this long standing upstream issue kubernetes/kubernetes#28486

For this reason, kubernetes provides a series of higher level workload APIs such as Deployment, StatefulSet, DaemonSet and Job. Controllers are encouraged to create these workloads instead of plain Pods. However, the existing workload APIs have feature gaps when it comes to training.

A recent advance in the Job API might change this situation: Indexed Jobs allow each Pod in a Job to have associated index as an annotation. The index is also part of the Pod name and hostname, making it easier to address each one. https://kubernetes.io/docs/concepts/workloads/controllers/job/#completion-mode https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/2214-indexed-job#design-details

I would like to open the discussion about which other features is the Job API missing for kubeflow operators to delegate Pod management to it. One feature I can think of is the ability to resize the number of workers of an Indexed Job. This should be possible in 1.23. Which other features do you see missing?

gaocegege commented 3 years ago

Hi @alculquicondor We discussed it before. There are some features which make us have to use Pod. If we can implement them with Jobs, I think we definitely should use Job.

For example, Restart for the specific exit code:

ExitCode means the restart behavior is dependent on the exit code of the tensorflow container as follows:

Exit code 0 indicates the process completed successfully and will not be restarted.

The following exit codes indicate a permanent error and the container will not be restarted:

1: general errors 2: misuse of shell builtins 126: command invoked cannot execute 127: command not found 128: invalid argument to exit 139: container terminated by SIGSEGV (invalid memory reference) The following exit codes indicate a retryable error and the container will be restarted:

130: container terminated by SIGINT (keyboard Control-C) 137: container received a SIGKILL 143: container received a SIGTERM Exit code 138 corresponds to SIGUSR1 and is reserved for user-specified retryable errors.

Other exit codes are undefined and there is no guarantee about the behavior.

For background information on exit codes, see the GNU guide to termination signals and the Linux Documentation Project.

alculquicondor commented 3 years ago

Thanks @gaocegege. This is the kind of information I'm looking for. Anything else you can think of? Do you agree that the feature I mentioned is also needed?

alculquicondor commented 3 years ago

Also, please clarify how OnExitCode is implemented. Plain Pods don't support that, so you can't support it at the container level, correct? This is only valid for Pod recreation.

gaocegege commented 3 years ago

Also, please clarify how OnExitCode is implemented. Plain Pods don't support that, so you can't support it at the container level, correct? This is only valid for Pod recreation.

Yes, some dirty tricks here.

https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v1/tensorflow/pod.go#L128

gaocegege commented 3 years ago

Also, please clarify how OnExitCode is implemented. Plain Pods don't support that, so you can't support it at the container level, correct? This is only valid for Pod recreation.

Yeah, we need the indexed job to run different replicas (e.g. ps-0, ps-1, .., ps-n). We also need:

alculquicondor commented 3 years ago

Yes, some dirty tricks here.

It seems that you do a Pod deletion. I suppose the next sync would create a new Pod again. I don't think we have to support this directly in the Job API. You could do the same (inspect the exit code and delete the Pod) and let the Job controller create the replacement for it. The good thing is that the Job controller will keep track of the number of restarts even if you delete the Pod (starting in k8s 1.23)

Policies to determine if we think the job is succeeded or failed

How is this different from all the Pods marked as Succeeded?

Dynamic autoscaler according to the metrics (scale subresources according to HPA)

What are the rules for downscaling? Do you always remove the pods with the highest indexes first?

alculquicondor commented 3 years ago

I don't think we have to support this directly in the Job API.

Actually, we might need to do something. The Job controller will always create a new Pod for an index when the previous one fails (up to the configured number of retries). So we either need to restrict recreation altogether or implement the ExitCode approach.

gaocegege commented 3 years ago

It seems that you do a Pod deletion. I suppose the next sync would create a new Pod again. I don't think we have to support this directly in the Job API. You could do the same (inspect the exit code and delete the Pod) and let the Job controller create the replacement for it. The good thing is that the Job controller will keep track of the number of restarts even if you delete the Pod (starting in k8s 1.23)

We will decide if there's a new Pod according to the exit code. Consider such a case: the exit code is 0, and we will not restart it.

I am not sure if we can or should do it with Job or IndexJob.

gaocegege commented 3 years ago

How is this different from all the Pods marked as Succeeded?

For example, in some cases, the job should be considered succeeded if 80% of workers are succeeded.

gaocegege commented 3 years ago

What are the rules for downscaling? Do you always remove the pods with the highest indexes first?

It depends. It's still in the design

gaocegege commented 3 years ago

Hi, Can you please provide the code or the docs for the index job? I am not quite familiar with it.

alculquicondor commented 3 years ago

We will decide if there's a new Pod according to the exit code. Consider such a case: the exit code is 0, and we will not restart it.

In this case, the Pod is considered Succeeded, so it wouldn't be recreated. But any other exit code causes a recreation, until the maximum number of retries is reached. We would need to change this to either pay attention to other exit codes or to restrict recreation to other signals, such as the Pod being deleted (although this is not reliable, because the garbage collector might remove the Pod).

For example, in some cases, the job should be considered succeeded if 80% of workers are succeeded.

That is interesting. Can you elaborate on the reasons for this? What if one of the workers is still writing something? If there is documentation on this, please point me to it.

Hi, Can you please provide the code or the docs for the index job? I am not quite familiar with it.

The official documentation is here https://kubernetes.io/docs/concepts/workloads/controllers/job/#completion-mode (but it's not updated for the changes in 1.22 yet). The design doc is here: https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/2214-indexed-job

gaocegege commented 3 years ago

That is interesting. Can you elaborate on the reasons for this? What if one of the workers is still writing something? If there is documentation on this, please point me to it.

When the model training replicas update weights asynchronously, we can think the job is succeeded if x% of workers are succeeded. It is related to the domain-specific logic, I think.

gaocegege commented 3 years ago

The design doc is here: https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/2214-indexed-job

Cool, when will it be released in K8s?

alculquicondor commented 3 years ago

Indexed Job is releasing as beta in k8s 1.22

gaocegege commented 3 years ago

OK, I will look through the code to see if we can use it. Thanks

Note for myself: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/job/job_controller.go

alculquicondor commented 3 years ago

I'm not sure how useful would be for you to look at the code. The API is well documented in https://kubernetes.io/docs/concepts/workloads/controllers/job and https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/job-v1/

As already discussed, there are some features that we would need to add for the API to be useful in training. That's what I want to focus this thread on.

gaocegege commented 3 years ago

OK, thanks.

ahg-g commented 3 years ago

When the model training replicas update weights asynchronously, we can think the job is succeeded if x% of workers are succeeded. It is related to the domain-specific logic, I think.

For example, in some cases, the job should be considered succeeded if 80% of workers are succeeded.

Is this an optimization to accelerate processing or a fundamental functional requirement (e.g., are there cases where some pods get stuck and never complete and so you are trying to have more replicas to ensure that enough of them complete)?

alculquicondor commented 3 years ago

An update from the last k8s SIG Apps meeting. In general, the community is onboard with adding features to the Job API to enable more use cases. In particular, they don't have major concerns for the following features:

Now, this is not a compromise. Both features need someone to work on them, go through API reviews, and so on. If the kubeflow community is onboard, it would be great if someone could get involved in those tasks. This will help ensure that the features fully satisfy kubeflow's requirements.

The third feature mentioned here (success policies based on percentage of success) was flagged as needing more details. I noticed that the current API has 2 modes: declare succeeded if worker-0 finished or declare succeeded if all workers finished. Could you confirm if I understood this correctly? If so, what is special about worker-0? What makes users choose one mode or the other?

gaocegege commented 3 years ago

Now, this is not a compromise. Both features need someone to work on them, go through API reviews, and so on. If the kubeflow community is onboard, it would be great if someone could get involved in those tasks. This will help ensure that the features fully satisfy kubeflow's requirements.

Sure, I think we can discuss more the details and of course, we can get involved in the tasks. 🍻

gaocegege commented 3 years ago

When the model training replicas update weights asynchronously, we can think the job is succeeded if x% of workers are succeeded. It is related to the domain-specific logic, I think.

For example, in some cases, the job should be considered succeeded if 80% of workers are succeeded.

Is this an optimization to accelerate processing or a fundamental functional requirement (e.g., are there cases where some pods get stuck and never complete and so you are trying to have more replicas to ensure that enough of them complete)?

It's sort of optimization.

gaocegege commented 3 years ago

The third feature mentioned here (success policies based on percentage of success) was flagged as needing more details. I noticed that the current API has 2 modes: declare succeeded if worker-0 finished or declare succeeded if all workers finished. Could you confirm if I understood this correctly? If so, what is special about worker-0? What makes users choose one mode or the other?

Now there are only two policies supported as you listed. worker-0 is the master worker. In some cases, we consider the job is succeeded if the master worker is succeeded.

The users should choose the right policy according to his/her training mode.

gaocegege commented 3 years ago

As you know the feature is alpha at 1.21, but we also need to support 1.14-1.18. Thus I think the process will be long.

Maybe we can implement a PoC of using Indexed Job. At the same time, we can try to contribute some features that we care about to the upstream Job controller.

WDYT

alculquicondor commented 3 years ago

I agree that the process will take long. If we were to implement the remaining features now, they would only be available as beta in 1.24, but that's being optimistic. It's still good to have it as a long term goal.

Maybe we can implement a PoC of using Indexed Job. At the same time, we can try to contribute some features that we care about to the upstream Job controller.

Yes to both questions. I can't promise to get to those now, as my priority is the mpi-operator. But if there is a contributor willing to get their hands on this, I can assist.

ahg-g commented 3 years ago

When the model training replicas update weights asynchronously, we can think the job is succeeded if x% of workers are succeeded. It is related to the domain-specific logic, I think.

For example, in some cases, the job should be considered succeeded if 80% of workers are succeeded.

Is this an optimization to accelerate processing or a fundamental functional requirement (e.g., are there cases where some pods get stuck and never complete and so you are trying to have more replicas to ensure that enough of them complete)?

It's sort of optimization.

Any idea how critical is this optimization?

It is great that you are interested in contributing to upstream k8s; I think as a batch community in the k8s ecosystem it is in everyone's best interest to try and continuously improve the core Jobs API; we understand that the process is sometimes long, but that is a price we pay for production readiness and long term support.

gaocegege commented 3 years ago

Any idea how critical is this optimization?

It is a nice-to-have optimization. I do not think it is a blocker, IMO.

gaocegege commented 3 years ago

It is great that you are interested in contributing to upstream k8s; I think as a batch community in the k8s ecosystem it is in everyone's best interest to try and continuously improve the core Jobs API; we understand that the process is sometimes long, but that is a price we pay for production readiness and long term support.

Personally, I also suggest using upstream Job API if it works, because managing pods directly is hard.

alculquicondor commented 3 years ago

Created proposal in k8s to count the number of Ready pods (stronger than Running) kubernetes/enhancements#2880

alculquicondor commented 3 years ago

@gaocegege would you be willing to work on a KEP to handle different status codes? Or perhaps you could recommend another member of this community to do so. I can provide guidance.

You would have to document the expected behavior when setting failure status codes. Probably some of the design that you did for tf-operator could be directly shared. Note that the status codes would have to be configurable, because other operators might use different ones for non-retriable failures. Another good point of reference is this old discussion in k8s about this topic: https://github.com/kubernetes/kubernetes/issues/17244.

gaocegege commented 3 years ago

Sure, I can help with it. Besides this and your KEP about ready status, is there any missing feature we need to implement?

alculquicondor commented 3 years ago

There are 2 more features, but they need further clarification of use cases, expected behavior and API:

I can also assist with them, but I admit that I don't understand the requirements well enough.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

ahg-g commented 2 years ago

/remove-lifecycle stale

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tenzen-y commented 1 year ago

/lifecycle frozen