Multi GPU support - Githubissues

Given there is already support for nccl, whats the overhead to add support for multi node gpu support for training/inference

There's 2 different ways to do multi-gpu: multi-device and multi-host. We'll need to do both to truly reach llm scale training, but we should start with multi-device.

The primary task to be solved is figuring out the optimal way to distribute nodes to different devices. In theory, a compiler can go through a complete graph and, with the knowledge of the available devices, make decisions on which ops should happen where.

A few decisions need to be made before any progress happens:

Does this partitioning happen before or after the bulk of the cuda compilers (if before, how can we deal with the changing compute from downstream compilers, if after, how do we get the charactaristics of each op (compute intensity, etc.) in a standardized way and/or handle the case where we don't know about an op).
How can we do parallelism. It's fine if we have a graph distributed among many gpus, but currently luminal is set up to run singular ops at once, so how can we run multiple ops at once. A seperate MultiGpuOp that contains multiple ops on each device and runs them at once?
Different parallelism has different compiler considerations here
- Data parallel
- Pipeline parallel
- Tensor parallel

I'm absolutely interested in moving forward on this, but I need to dedicate some time to really thinking deeply on the best path.

There's 2 different ways to do multi-gpu: multi-device and multi-host. We'll need to do both to truly reach llm scale training, but we should start with multi-device.

The primary task to be solved is figuring out the optimal way to distribute nodes to different devices. In theory, a compiler can go through a complete graph and, with the knowledge of the available devices, make decisions on which ops should happen where.

A few decisions need to be made before any progress happens:

Does this partitioning happen before or after the bulk of the cuda compilers (if before, how can we deal with the changing compute from downstream compilers, if after, how do we get the charactaristics of each op (compute intensity, etc.) in a standardized way and/or handle the case where we don't know about an op).

How can we do parallelism. It's fine if we have a graph distributed among many gpus, but currently luminal is set up to run singular ops at once, so how can we run multiple ops at once. A seperate MultiGpuOp that contains multiple ops on each device and runs them at once?

Different parallelism has different compiler considerations here

Data parallel

Pipeline parallel

Tensor parallel

I'm absolutely interested in moving forward on this, but I need to dedicate some time to really thinking deeply on the best path.

I agree that multi-device first. Candle has this interesting multi-process example. You can even adapt that for multi-device using mpi or just nccl and build a communication layer p2p. Regarding a complete graph of ops, nccl handles inter/intra-node well.

Notes:

Heterogenous HW inference or training is challenging.
I looked into deepspeed and pytorch distributed modules and they support many model parallelism techniques with different backends (gloo, mpi, nccl) or a combination.
I found this recently async cuda could be interesting to introduce. One point about sequential operations in luminal: To support parallelism, the main thread (initiator of the inference or training in an all reduce or ring reduce) must not be blocked. After that, it depends on your algorithm (Data, pipeline, tensor parallel). Pipeline Parallelism executes different model stages concurrently on separate nodes. But that's concurrent. Data parallelism is also very interesting; this could help in devices that can't fit the entire model in memory. Could also combine them DP+PP, just like in deepspeed.

Goals:

nodes must not wait for other nodes to finish
async cuda
in terms of DAG, i created an example recently for fine tuning across k8s nodes using argo DAG workflows. A similar approach can be adopted for DAG-based jobs
start with a small example
Reduce data dissemination and duplicate data transmission overheads across nodes as much as possible. basically reduce comms and compute time

Possible approaches depending on setup:

if multi device/gpu, we can use ZeRO but inter node communication must be fast. Otherwise must explore other combinations from huggingface article.

This is a very relevant topic for applications. @jafioti to be honest, and I am not an expert on the approach taken by Luminal, it makes sense to me to have tensor parallelism at the graph level, so graph ops can be distributed across multiple GPUs.

On the other hand, having full graph operations being distributed across distinct devices could lead to some subtleties around interdependencies of states across various devices.

Another topic that could be relevant to be taken into consideration in a future implementation is determinism. Given the non-transitive nature of all reduction operations for floating points, the order in which multi-GPUs finish their execution might have an impact onto the final output. Having an implementation that realizes full determinism (at least on a single machine) will be relevant for certain applications (even at a possible cost of performance).

jorgeantonio21

Determinism is essential. Given the project I am working on, cross-platform consistency and verifiable communication (moot) across nodes are paramount.

Given this paper Agatha, here are some conclusions we have taken into consideration in writing our paper:

Problem

As you said, a computation will usually give two different results on two different machines, even when the same source of randomness is used, whether in inference or training. This is due to the accumulation of rounding errors in floating-point arithmetic.

Even when using the same randomness seed, two different machine learning computations on two separate machines may produce different results due to potential variations in hardware architecture, software implementations, or runtime conditions. Different hardware's inherent parallelism and optimization strategies can lead to subtle numerical differences in floating-point operations, impacting the model's intermediate states during training or inference. Variations in software libraries, compiler optimizations, or even the underlying operating system can also contribute to discrepancies. Moreover, differences in the order of parallelized operations or the timing of asynchronous tasks may introduce subtle divergences in the computations.

While randomness seeds aim to provide reproducibility, these inherent system-level variabilities can cause small numerical divergences, resulting in distinct outcomes across different machines despite identical seeds.

Solutions

One way to address this issue is to use Quantization, which addresses the variability in computations by reducing the precision of numerical representations. In traditional neural networks, computations are performed using high-precision floating-point numbers, susceptible to subtle variations across different hardware architectures. Quantization involves mapping these floating-point values to a lower precision, typically fixed-point or integer representations. This reduction in precision accelerates inference by requiring less memory and computation and enhances reproducibility.

Another critical topic is inter-node training/inference communication efficiencies. As claimed in this section ml-engineering, the gap regarding comm overheads is closing. But this assumes all nodes operate and speak the same language with the same set of HW. What if they don't? Can we make them good enough for multi-GPU multi-node setups?

I think for now let's just think about how we can make this work on a single machine. Multi-machine determinism seems quite difficult (though maybe not impossible).

For single-machine inference, I'm not too concerned with slight differences in floating points, as inference is generally more accepting of errors (see low bit quantized inference). Training is much more sensitive, but if other frameworks are able to cope with the errors, we should be able to as well.

What I am interested in with regards to determinism is the ability to precisely hide latency. If we can deterministically know how long data travel takes, we can do things like ring-attention and pipeline parallelism by computing stuff while data is in transit, and know how long it will take for that data to arrive. A well choreographed dance where hardware util is maximized. This might be a ways away for now though.

For the immediate future, I think this can be tackled in a few steps:

Get tensor-parallel matmuls working This is simple in that you can have matmuls over a certian size get split among N devices to do the same matmul. We'll likely need some additional op for copying data between devices (d-to-d) which will reference the shapetracker to see which data to copy (only part of the tensor)
Get device-parallel ops working, where say you have: b = a.expensive_op1(); c = a.expensive_op2(); d = b + c; Then the expensive_op1 and expensive_op2 can happen on different devices.

Then we can start thinking about pipeline parallelism. This will be non-trivial because luminal is fundamentally single-threaded, so it executes a single stream of ops at once. The likely solution here is to have wrapper ops that contain all parts of the pipeline and execute all parts of the pipeline at once on different devices. So if you have op1 -> op2 -> op3 and 3 devices, you'll compile to parallel(d1:op1, d2:op2, d3:op3) -> parallel_transfer -> parallel(d1:op1, d2:op2, d3:op3) -> parallel_transfer -> parallel(d1:op1, d2:op2, d3:op3) This can all be derived by a compiler given the original graph.

I'll have to think about this more, but the first step, tensor parallel matmuls, seems to be the most straightforward. The pipeline parallel approach can also likely power data parallel as well, by using the parallel() ops.

jafioti / luminal

Multi GPU support #48

Problem

Solutions