kubeflow / mpi-operator

Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
417 stars 209 forks source link

Work with DeepSpeed for large scale training #611

Open kuizhiqing opened 6 months ago

kuizhiqing commented 6 months ago

DeepSpeed is an excellent framework for training LLMs on a large scale, while the mpi-operator is the ideal tool to facilitate this within the Kubernetes ecosystem.

I'm planning to submit a series of PRs to make this project more ready-to-use for very large scale training with the DeepSpeed/mpi-style training framework.

The upcoming features may include the following modifications:

Support for IP-style hostfile This is for performance efficiency and to prevent the environment variable length from exceeding its limit when using svc for those who wish to wrap it into an environment variable.

Support for fault tolerance and elasticity This is a quasi-fault tolerance since NCCL communication must always be recreated when an error occurs. However, it's still worth implementing because recreating pods can be costly on a very large scale.

Configuration decoupling There are some requirements that are currently left to the docker image maker to handle, such as _sshconfig and _sshdconfig. Perhaps the operator can manage all of these.

There are also some minor changes under consideration. Please feel free to share your thoughts on this topic.

alculquicondor commented 6 months ago

cc @aojea @vsoch

alculquicondor commented 6 months ago

Support for IP-style hostfile

How do you plan to obtain the IPs? The current model uses a ConfigMap for the hostfile, which would have pretty decent size limits. Also, it's all transparent to the user.

kuizhiqing commented 6 months ago

Support for IP-style hostfile

How do you plan to obtain the IPs? The current model uses a ConfigMap for the hostfile, which would have pretty decent size limits. Also, it's all transparent to the user.

The idea is to update the hostfile with IP after all containers(not pod) are ready.

Files have no length limit, environ variable does, AFAIK, one may want to convert the nodes list into env or args to make execution, while using svc may exceed the length limit in very large scale.

Two main reason to do that, with very large scale job, svc is not performance efficient and it may not stable enough since it depend on coreDNS or alternatives.

BTW, CM for hostfile is a good idea since it support runtime update which makes it possible to support quasi-elastic.

vsoch commented 6 months ago

That was a design I chose for the Flux Operator before switching to a headless service, would be interesting to see how it works for you! I can point you to old code if you want an example of how I did that - it was basically designed exactly as you stated.

I am not familiar with DeepSpeed, but is ssh bootstrap the best / only way available?

aojea commented 6 months ago

The current model uses a ConfigMap for the hostfile

does this means it creates a configmap with the etc hosts content and mounts it with a volume in /etc/hosts?

Files have no length limit, environ variable does, AFAIK, one may want to convert the nodes list into env or args to make execution, while using svc may exceed the length limit in very large scale.

What environment variables limitation and length limit are you referring to here?

With a Headless service each pod has a well known record https://github.com/kubernetes/dns/blob/master/docs/specification.md#241---aaaaa-records

alculquicondor commented 6 months ago

Ok, just to get rid of unnecessary feature requests, let's forget about passing a hostfile via an environment variable.

Two main reason to do that, with very large scale job, svc is not performance efficient and it may not stable enough since it depend on coreDNS or alternatives.

But instead, you will have to wait for all worker pods to be ready before being able to create the launcher Pod. This means that you will have to wait for container images to download. That time will probably be more significant than DNS name resolution.

Also, if anything happens to a worker Pod, then you will have to start over and recreate the hostfile.

BTW, CM for hostfile is a good idea since it support runtime update which makes it possible to support quasi-elastic.

How does elasticity work in deepspeed? Is it similar to horovod?

does this means it creates a configmap with the etc hosts content and mounts it with a volume in /etc/hosts?

No, a hostfile is a requirement from MPI (https://www.open-mpi.org/faq/?category=running#mpirun-hostfile). The mpi-operator populates using the domain name for each Pod. If we were to wait for the pods to be created, we could put the Pod IPs instead, which is what the OP is proposing. Which sounds like we would be re-implementing DNS.

vsoch commented 6 months ago

Also, if anything happens to a worker Pod, then you will have to start over and recreate the hostfile.

This was the main point that made me pause on the design. However, I do think the length of the hostnames are an issue, especially when you get into indexed job. With regular job, it's usually successful with a reasonable (and somewhat short) service and job name. However, when you go into JobSet (if you do) I found that I had to make everything tiny - the service name often 2 letters, and the different jobs also indicated by letters. The design of how the JobSet members map to the names didn't allow for the same lengths as regular job.

When we were debugging networking issues, at least for small numbers of pods I did compare writing the hostfile to using the headless service, and the first didn't help much - the means / quartiles were very much overlapping. https://github.com/kubernetes/kubernetes/issues/117819#issuecomment-1550800170

kuizhiqing commented 6 months ago

@vsoch Great, thx, I'm already completed the implementations and make those features into production, what I'm wandering is to merge some features back here.

kuizhiqing commented 6 months ago

Also, if anything happens to a worker Pod, then you will have to start over and recreate the hostfile.

This was the main point that made me pause on the design. However, I do think the length of the hostnames are an issue, especially when you get into indexed job. With regular job, it's usually successful with a reasonable (and somewhat short) service and job name. However, when you go into JobSet (if you do) I found that I had to make everything tiny - the service name often 2 letters, and the different jobs also indicated by letters. The design of how the JobSet members map to the names didn't allow for the same lengths as regular job.

When we were debugging networking issues, at least for small numbers of pods I did compare writing the hostfile to using the headless service, and the first didn't help much - the means / quartiles were very much overlapping. kubernetes/kubernetes#117819 (comment)

Well, since NCCL is not support fault tolerance, perfect recovery is impossible in GPU training scenario, what the we can do it to replace one pod a time without recreating thousands/hundred of pods. The failure node trigger the internal restart and IP update in the hostfile is the current design for me.

alculquicondor commented 6 months ago

If the host names are too large, maybe we can make this optional https://github.com/kubeflow/mpi-operator/issues/453 (via command line flag, for example). In most scenarios, we don't actually need the service name.

kuizhiqing commented 6 months ago

When I say env length limit, I refer https://informix.hcldoc.com/14.10/help/index.jsp?topic=%2Fcom.ibm.sqlr.doc%2Fids_sqr_405.htm.

It happen when user use mpirun --host $hosts or pdsh -w $hosts or something similar.

kuizhiqing commented 6 months ago

Well I will hold the PR process and welcome more discussion on the feature requirement.

More ideas are really appreciate, @alculquicondor @vsoch @aojea @willb .

Thx

alculquicondor commented 6 months ago

It happen when user use mpirun --host $hosts or pdsh -w $hosts or something similar.

We shouldn't support that. It feels unnecessary.

alculquicondor commented 6 months ago

Regarding using IPs instead of DNS, I would like to hear from @aojea, as TL for Kubernets SIG network.

My gut feeling is that improving DNS building and lookup should be done by k8s or providers, and we shouldn't re-implement it in mpi-operator.

tenzen-y commented 6 months ago

Support for IP-style hostfile This is for performance efficiency and to prevent the environment variable length from exceeding its limit when using svc for those who wish to wrap it into an environment variable.

I'm wondering if we should support this feature since as other forks say, it sounds like re-implementing DNS.

Support for fault tolerance and elasticity This is a quasi-fault tolerance since NCCL communication must always be recreated when an error occurs. However, it's still worth implementing because recreating pods can be costly on a very large scale.

I believe that restarting the training process and reconnecting with other workers are responsible for the application side. Also, what are the costs to recreate pods on a very large scale? Maybe we should get rid of the costs instead of implementing retry logic depending on the deepspeed.

Configuration decoupling There are some requirements that are currently left to the docker image maker to handle, such as ssh_config and sshd_config. Perhaps the operator can manage all of these.

If it is possible, it would be great.

aojea commented 6 months ago

Regarding using IPs instead of DNS, I would like to hear from @aojea, as TL for Kubernets SIG network.

Do you have more details on how the workloads are deployed?

I do think that using hostnames is the way to go as it is more human friendly and headless service solves this problem as each Pod will have a resolvable name, if we just want to avoid the DNS problems we can always add a simple sidecar or something that adds the hostname IP entries to the /etc/hosts file in the Pods

vsoch commented 6 months ago

we can always add a simple sidecar or something that adds the hostname IP entries to the /etc/hosts file in the Pods

That's a really cool idea! I love sidecars (and init containers), so much you can do with them.

@kuizhiqing if you wind up trying that out can you point me to it to look? If not I'd like to try prototyping it - it seems useful for other cases (I'm thinking JobSet where it's more common to go over the limit).

Syulin7 commented 6 months ago

Support for fault tolerance and elasticity This is a quasi-fault tolerance since NCCL communication must always be recreated when an error occurs. However, it's still worth implementing because recreating pods can be costly on a very large scale.

Does DeepSpeed support fault tolerance and elasticity without recreating pods or reloading checkpoints?

kuizhiqing commented 6 months ago

if you wind up trying that out can you point me to it to look? If not I'd like to try prototyping it - it seems useful for other cases (I'm thinking JobSet where it's more common to go over the limit).

Thanks for those great ideas, I've adopted the IP version solution which works well in my case.

kuizhiqing commented 6 months ago

@Syulin7 @tenzen-y Let me try to explain it like this, consider a GPU training job with thousands pods, the GPU device failure is happen frequently during the training process and it cannot recovery by restart the process, the physical machine should be replaced with another. In this scenario, resubmit a job works but not efficient compare to recreate one pod and update the host info in hostfile, while the deepspeed user can restart the job internally.

It make sense that it may out of the scope of cloud native design, especially in mpi-operator, while it's indeed a strong requirement in practice.

vsoch commented 6 months ago

This would be handled well by the flux operator, which uses zeromq to bootstrap and if a pod (follower broker) goes down flux would see the node as down, and we could schedule to another node (possibly newly added, which would join the cluster and then be seen as going from down to up). Nodes going up and down happens all the time in HPC so our workload managers are used to handling that, and for the flux operator you essentially get your own scheduler within the indexed job. We do, however, use a headless service and not the host file.

If you are interested and have a dummy example that shows commands for launching different stuff I would be happy to prototype something to think about.

alculquicondor commented 6 months ago

In this scenario, resubmit a job works but not efficient compare to recreate one pod and update the host info in hostfile

That's exactly why using DNS solves the problem. If there is an error, and we need to retry, there is no need to recreate the hostfile. mpi-operator would create a new Pod and the DNS provider would be updated to point to the new Pod IP. flux-operator is doing the same.

From @aojea:

Do you have more details on how the workloads are deployed?

Let me try to summarize what the operator does:

When the launcher starts, it reads the hostfile and starts SSH connections to each of them.

aojea commented 6 months ago
  • create a launcher Job, mounting the configmap above

so the only optimization I can see is to append to the existing /etc/hosts file in the Pods, with the pod/ip pair so the pods can resolve that directly, but DNS should work, is it not working?

alculquicondor commented 6 months ago

DNS works, but some DNS providers have a limit in the number of pods. Or they might have non-negligible latency at scale.

so the only optimization I can see is to append to the existing /etc/hosts file in the Pods

But then we need to wait for the worker Pods to be created before being able to start the launcher (which is the one that needs the IPs). That's almost the same as just using IPs in the hostfile, instead of today's hostnames.

kuizhiqing commented 6 months ago

Yes, the DNS stuff is clear until now, I may conclude that the IP version hostfile is not a feature worth merging and let's leave it to those who need it in their implementation.

@vsoch I've found your project flux-operator interesting, let me come back to you after I got more details.

vsoch commented 6 months ago

Sure! And make sure to look at the refactor branch - I fixed a ton of design flaws and it's much better, but waiting on a paper (and likely some more testing) before merging that in. https://github.com/flux-framework/flux-operator/issues/211

vsoch commented 6 months ago

I think I had a similar issue to this:

create a configMap that contains the hostfile, where each line has the hostname of the worker pods.

When I was working on the metrics operator. The issue is that some variants of mpi require the actual ipaddress. So what I did:

  1. First generate the predictable hostlist (the headless service names): https://github.com/converged-computing/metrics-operator/blob/8835f152a63daf3956193ddf717a36ae1b78d6a9/pkg/metrics/launcher.go#L277C1-L291
  2. For the metrics that used such a variant, I used this hack https://github.com/converged-computing/metrics-operator/blob/8835f152a63daf3956193ddf717a36ae1b78d6a9/pkg/metrics/app/bdas.go#L81-L93

It's not great, it's basically looking them up on the fly before running. The rationale I had (aside from only a few apps needing this) is that if a pod fails in the middle of the run (because a pod dies) it's going to throw up and fail anyway - there is no good (easy) concept of elasticity (that I know of / have tried) for MPI itself so if you lose a pod, you are out of luck anyway (and updating the hostslists doesn't matter). For the use case where you are running mpirun several times (and you have changing pods), this use case might make sense to have some list that is self updating, in which case (for a quick test) I'd do the equivalent janky trick just before running something.

vsoch commented 6 months ago

You could also just use a different MPI flavor that will use the DNS names and call it a day :)