jafioti / luminal

Deep learning at the speed of light.
https://luminalai.com
Apache License 2.0
1.4k stars 86 forks source link

Multi GPU support #48

Open b0xtch opened 2 months ago

b0xtch commented 2 months ago

Given there is already support for nccl, whats the overhead to add support for multi node gpu support for training/inference

jafioti commented 2 months ago

There's 2 different ways to do multi-gpu: multi-device and multi-host. We'll need to do both to truly reach llm scale training, but we should start with multi-device.

The primary task to be solved is figuring out the optimal way to distribute nodes to different devices. In theory, a compiler can go through a complete graph and, with the knowledge of the available devices, make decisions on which ops should happen where.

A few decisions need to be made before any progress happens:

I'm absolutely interested in moving forward on this, but I need to dedicate some time to really thinking deeply on the best path.

b0xtch commented 2 months ago

There's 2 different ways to do multi-gpu: multi-device and multi-host. We'll need to do both to truly reach llm scale training, but we should start with multi-device.

The primary task to be solved is figuring out the optimal way to distribute nodes to different devices. In theory, a compiler can go through a complete graph and, with the knowledge of the available devices, make decisions on which ops should happen where.

A few decisions need to be made before any progress happens:

  • Does this partitioning happen before or after the bulk of the cuda compilers (if before, how can we deal with the changing compute from downstream compilers, if after, how do we get the charactaristics of each op (compute intensity, etc.) in a standardized way and/or handle the case where we don't know about an op).
  • How can we do parallelism. It's fine if we have a graph distributed among many gpus, but currently luminal is set up to run singular ops at once, so how can we run multiple ops at once. A seperate MultiGpuOp that contains multiple ops on each device and runs them at once?
  • Different parallelism has different compiler considerations here

    • Data parallel
    • Pipeline parallel
    • Tensor parallel

I'm absolutely interested in moving forward on this, but I need to dedicate some time to really thinking deeply on the best path.

I agree that multi-device first. Candle has this interesting multi-process example. You can even adapt that for multi-device using mpi or just nccl and build a communication layer p2p. Regarding a complete graph of ops, nccl handles inter/intra-node well.

Notes:

Goals:

Possible approaches depending on setup:

jorgeantonio21 commented 2 months ago

This is a very relevant topic for applications. @jafioti to be honest, and I am not an expert on the approach taken by Luminal, it makes sense to me to have tensor parallelism at the graph level, so graph ops can be distributed across multiple GPUs.

On the other hand, having full graph operations being distributed across distinct devices could lead to some subtleties around interdependencies of states across various devices.

Another topic that could be relevant to be taken into consideration in a future implementation is determinism. Given the non-transitive nature of all reduction operations for floating points, the order in which multi-GPUs finish their execution might have an impact onto the final output. Having an implementation that realizes full determinism (at least on a single machine) will be relevant for certain applications (even at a possible cost of performance).

b0xtch commented 2 months ago

jorgeantonio21

Determinism is essential. Given the project I am working on, cross-platform consistency and verifiable communication (moot) across nodes are paramount.

Given this paper Agatha, here are some conclusions we have taken into consideration in writing our paper:

Problem

As you said, a computation will usually give two different results on two different machines, even when the same source of randomness is used, whether in inference or training. This is due to the accumulation of rounding errors in floating-point arithmetic.

Even when using the same randomness seed, two different machine learning computations on two separate machines may produce different results due to potential variations in hardware architecture, software implementations, or runtime conditions. Different hardware's inherent parallelism and optimization strategies can lead to subtle numerical differences in floating-point operations, impacting the model's intermediate states during training or inference. Variations in software libraries, compiler optimizations, or even the underlying operating system can also contribute to discrepancies. Moreover, differences in the order of parallelized operations or the timing of asynchronous tasks may introduce subtle divergences in the computations.

While randomness seeds aim to provide reproducibility, these inherent system-level variabilities can cause small numerical divergences, resulting in distinct outcomes across different machines despite identical seeds.

Solutions

One way to address this issue is to use Quantization, which addresses the variability in computations by reducing the precision of numerical representations. In traditional neural networks, computations are performed using high-precision floating-point numbers, susceptible to subtle variations across different hardware architectures. Quantization involves mapping these floating-point values to a lower precision, typically fixed-point or integer representations. This reduction in precision accelerates inference by requiring less memory and computation and enhances reproducibility.


Another critical topic is inter-node training/inference communication efficiencies. As claimed in this section ml-engineering, the gap regarding comm overheads is closing. But this assumes all nodes operate and speak the same language with the same set of HW. What if they don't? Can we make them good enough for multi-GPU multi-node setups?

jafioti commented 2 months ago

I think for now let's just think about how we can make this work on a single machine. Multi-machine determinism seems quite difficult (though maybe not impossible).

For single-machine inference, I'm not too concerned with slight differences in floating points, as inference is generally more accepting of errors (see low bit quantized inference). Training is much more sensitive, but if other frameworks are able to cope with the errors, we should be able to as well.

What I am interested in with regards to determinism is the ability to precisely hide latency. If we can deterministically know how long data travel takes, we can do things like ring-attention and pipeline parallelism by computing stuff while data is in transit, and know how long it will take for that data to arrive. A well choreographed dance where hardware util is maximized. This might be a ways away for now though.

For the immediate future, I think this can be tackled in a few steps:

Then we can start thinking about pipeline parallelism. This will be non-trivial because luminal is fundamentally single-threaded, so it executes a single stream of ops at once. The likely solution here is to have wrapper ops that contain all parts of the pipeline and execute all parts of the pipeline at once on different devices. So if you have op1 -> op2 -> op3 and 3 devices, you'll compile to parallel(d1:op1, d2:op2, d3:op3) -> parallel_transfer -> parallel(d1:op1, d2:op2, d3:op3) -> parallel_transfer -> parallel(d1:op1, d2:op2, d3:op3) This can all be derived by a compiler given the original graph.

I'll have to think about this more, but the first step, tensor parallel matmuls, seems to be the most straightforward. The pipeline parallel approach can also likely power data parallel as well, by using the parallel() ops.