perf: request some benchmarks and compare them with results in native slurm

CrackedPoly commented 3 weeks ago

I wonder how much overhead does soperator introduces in ML, compared with native slurm. This is an important concern and I want to know if you have any statistics.

Some scenarios

Single machine

1 GPU training benchmark
8 GPUs distributed training benchmark (Nvlink involved)

Distributed

16 GPUs distributed training benchmark (both Nvlink and IB involved)

rdjjke commented 3 weeks ago

Hello @CrackedPoly,

Thank you for the question.

In our experience, Soperator (Slurm on Kubernetes) introduces no noticeable performance degradation compared to Slurm-over-VMs configurations.

If by “native Slurm” you’re referring to Slurm running on bare-metal servers, we haven’t run those tests directly ourselves. However, we can compare our results with other public data from cloud providers for context.

Performance testing

In all our tests, the nodes were the same:

Virtual machine instances
Each node with 8 H100 GPUs, 128 vCPU Intel Xeon Platinum, and 1600 GiB RAM
NVLink interconnect for GPUs on the same node
Nodes are interconnected with Infiniband

2-node stable diffusion training (MLCommons benchmark v3.0)

The benchmarked performance of Soperator aligns with that of a standard Slurm-over-VMs setup, averaging ~20-20.5 seconds per 100 training steps.

64-node GPT-3 pretraining (MLCommons benchmark v4.0)

My single-run test yielded a result of 54.03 minutes, showing:

3.3% slower than this NVIDIA’s result (52.28 min)
7% slower than NVIDIA’s best result (50.45 min). But as I understand, this is just another run of the same job on the same cluster.
4.6% faster than this SMC result (56.66 min). Their other results are almost the same.

NVIDIA's results are typically optimal, so a slight difference is to be expected. Also, since I've only tried it once, this result is probably not the best that Soperator can have. Overall, these comparisons suggest that Soperator performs on par with native Slurm setups.

The training performance depends more on the hardware and perhaps even on the temperature in data centers than on the use of containers and K8s.

Our simpler checks, such as NCCL tests, also show no difference.

Overhead Considerations

In theory, Soperator doesn’t introduce specific overhead in training workloads. While there may be minor syscall overhead due to containerization, it’s inconsequential to training performance. Furthermore:

GPUs remain non-virtualized, and model training is generally CPU-light.
Soperator can be installed directly on a Kubernetes cluster running over bare metal, eliminating any VM-induced latency.

The only potential initialization slowdown we’ve observed stems from using a shared root filesystem, which can delay library loading during startup. However, this effect is minimal and not very significant in model training. You also have an opportunity to store them on local disks.

I apologize for the lack of formally designed benchmark results. But, based on theory and our observations, we believe there is no noticeable overhead.

CrackedPoly commented 3 weeks ago

Hi @rdjjke , thank you so much about the information! It seems soperator is a very promising solution. But I want to comfirm that by Slurm-over-VMs, you mean running slurmd in VMs and spawn job process there or running slurmd in bare metals but jobs are executed in VMs?

rdjjke commented 3 weeks ago

By Slurm-over-VMs I mean a typical Slurm installation (without Soperator) where Slurm daemons are installed on virtual machines (including slurmd). Jobs are child processes of slurmd, so they're also executed on virtual machines. In this case, the virtualization overhead exists (which is small, but still exists for CPU, filesystem, syscalls), but there is no Kubernetes / containerization overhead (which is so small that it can be ignored anyway).

If your Slurm deployment is in a cloud (i.e. AWS, Azure, Nebius, etc.) then the compute instances you use are probably virtual machines. But there are some bare-metal clouds, which provide bare-metal hosts to their users (not VMs). They can also provide K8s clusters over bare-metal hosts.

When I talked about NVIDIA and SMC in the previous message, I meant that they most likely used bare-metal setups.

rdjjke commented 3 weeks ago

I may not have described it quite clearly, so I'll try to rehash it.

Kubernetes takes a number of hosts (which can be virtual machines or bare-metal servers) and runs containers on them. Everything else it provides is convenient management of these containers. So the overhead of using Kubernetes is equal to the overhead of using containers.

The overhead of containers is negligibly small. BTW, Slurm jobs in the MLCommons recipes use enroot containers anyway, so they're containerized even when Soperator isn't used. In the Soperator's case, these containers run inside K8s containers, which is also OK.

The overhead of virtual machines exists, but it's small for model training because GPUs are usually not virtualized.

CrackedPoly commented 3 weeks ago

Cool. Let me summarize this.

In your benchmark v3.0, you are comparing containers in VMs and containers in containers in VMs, which are all conducted in your cloud. The 2-layer containerization is essentially the same as 1-layer containerization because of hierarchical cgroup. So no performance gap in action is expected.
In your benchmark v4.0, you are comparing containers in containers in VMs and containers in bare-metals. So the performance gap is the virtualization. And because of the uncontrolled environment, the figures compared with public results (NVIDIA, SMC) come from virtualization and some minor environment difference.

rdjjke commented 2 weeks ago

Yes. Correct.

nebius / soperator