Deep learning at the speed of light.
Luminal is a deep learning library that uses composable compilers to achieve high performance.
use luminal::prelude::*;
// Setup graph and tensors
let mut cx = Graph::new();
let a = cx.tensor((3, 1)).set([[1.0], [2.0], [3.0]]);
let b = cx.tensor((1, 4)).set([[1.0, 2.0, 3.0, 4.0]]);
// Do math...
let mut c = a.matmul(b).retrieve();
// Compile and run graph
cx.compile(<(GenericCompiler, CPUCompiler)>::default(), &mut c);
cx.execute();
// Get result
println!("Result: {:?}", c);
Llama 3 8B
cd ./examples/llama
# Download the model
bash ./setup/setup.sh
# Run the model
cargo run --release --features metal # MacOS (Recommended)
cargo run --release --features cuda # Nvidia
cargo run --release # CPU
Luminal can run Q8 Llama 3 8B on M-series Macbooks at 15-25 tokens per second. The goal is to become the fastest ML framework for any model on any device.
The core of luminal is and always will be minimal. It should be possible to understand the entire core library in an afternoon.
Everything in luminal boils down to 11 primitive ops:
Log2, Exp2, Sin, Sqrt, Recip
Add, Mul, Mod, LessThan
SumReduce, MaxReduce, Contiguous
These ops are enough to support transformers, convnets, etc.
The current ML ecosystem is too fragmented, and the solution isn't another layer of abstraction. Luminal is written in rust, and interacts directly with the CUDA / Metal APIs. No indirections or abstractions, docker containers, or virtual environments. Just a statically-linked rust crate.
Correctness matters. So we write as much tests as possible to cover all ops and verify they work the same as an equivalent Pytorch implementation. (Improvements needed!)
Most deep learning libraries are eager-first, meaning each op call directly operates on the data. In PyTorch, when you see x + y
, the addition actually happens right there. This is great for debugging because it works exactly as most developers expect.
However, this isn't great for performance. What makes sense for a developer doesn't work well for the machine, in the same way that no one writes assembly by hand. Most libraries try to fix this problem by tacking on operator fusion or JIT compilation to try to change the compilation flow to something better for the machine. Turns out this is super difficult even for Pytorch!
A core tenet of Luminal is ahead-of-time compilation. Whenever possible, push everything to compile time and leave nothing to run time. Luminal takes an approach more similar to XLA, and tinygrad. Everything's static here. When you write out an expression like x + y
, no actual computation happens. The operation is recorded to a directed acyclic computation graph for execution later. Only once graph.execute()
is ran does the computation happen. But isn't that just lazy execution? Yes it is! But in luminal everything is done this way. All neural networks are built up as one or a few static computation graphs, compiled, and executed later.
But why?
A consequence of this is that the actual computation that gets ran can be radically different than the code that was written. Since we have an entire neural network fully represented in a compute graph, our compilers have global knowledge. This means we can push most ML complexity to the compilers. For instance, devices, datatypes, and execution schedules are all handled by compliers. Even autograd will be handled by a compiler!
Now we can do:
Once you've written all your computation code, run cx.display()
to see the entire computation graph in all it's glory. Pretty messy looking! Now run cx.compile(GenericCompiler::default())
and display the graph again. Much better.
examples/
. See instructions above for running.luminal_nn
, including transformers.hl_ops
. We are aiming to match the most used ~80% of the pytorch api.Some things on the roadmap:
Licensed under the Apache License, Version 2.0 http://www.apache.org/licenses/LICENSE-2.0 or the MIT license http://opensource.org/licenses/MIT, at your option. This file may not be copied, modified, or distributed except according to those terms.