Closed pineking closed 1 year ago
We are seeing ca. 30% speedup at int4 vs. fp16 but nowhere near the benchmarks listed in the readme.
I notice that @turboderp is using a 12900K which has phenomenal single threaded performance. So it appears that we are still CPU bound for most people, which I find perplexing.
@pineking: The inference speed at least theoretically is 3-4x faster than FP16 once you're bandwidth-limited, since all that ends up mattering is how fast your GPU can read through every parameter of the model once per token. De-quantizing the weights on the fly is cheap compared to the memory access and should pipeline just fine, with the CUDA cores easily doing all the required computation on one batch of weights while waiting for the next batch to load from VRAM.
I should be getting a 3090-Ti tomorrow that has somewhat slower (and fewer) cores than the 4090 but the same memory bandwidth, so I should be able to confirm that it performs about as well as the 4090. Already I've confirmed that a 3070-Ti with half the bandwidth of the 4090 also gets about half the performance.
@qeternity: It's really difficult to profile a CPU bottleneck that I can't see because my CPU is too fast. :) I might try underclocking it at some point, or see if I can't force it to only use E-cores somehow.
You just can't meaningfully measure the time it takes a PyTorch operation to complete since everything actually runs in the CUDA queue. It's also kind of unpredictable the way PyTorch manages resources under the hood. All the automation is very convenient, and the overhead becomes negligible when you're just offloading large batches of computation to a GPU (the typical scenario during training, which appears to be the intended use case for PyTorch), but when going token by token many of the operations end up being quite small, so the overhead per-operation from using a complex framework in an interpreted language becomes significant.
I try to work around that by bypassing PyTorch as much as I can. But this creates another problem of having to manually tune everything and that's a lot of work made much harder by the fact that I don't have 20 different hardware configurations to test on.
The approach I'm going for is to make it as fast as possible on my setup while keeping a catalog of all the methods I've experimented with. Then at some point (or gradually) I can add some of those as options that may work better on systems with slower CPUs, slower system RAM, older GPUs, power-limited GPUs, who knows. Keeping it all in the codebase makes it unmanageable, though. There are already a lot of permutations to validate, and the number doubles with every option added.
I'm still curious what you mean by "nowhere near", though. Exactly what speeds are you getting, and what hardware setup is producing those poor results? What specific models?
Hi @turboderp - amazing work you've done! Didn't mean to imply otherwise. Thanks very much!
My observation re: CPU bottleneck is not directed at your work, it's something I see in most transformer implementations. And I haven't done any work to assess why this is the case, but your comments make sense to me.
With a 3090 Ti and an Epyc 7282 (very slow), we are getting ca. 34 t/s on a 7b models and 26 t/s on 13b models.
There must be something else going on, then. That's a little less than half the single-core performance of the 12900K, but you're seeing much less than half the performance. 3090-Ti should be comparable to the 4090 (I'll know for sure tomorrow when it arrives) so there has to be something else slowing it down.
Is anything else using the GPU at the same time? I've noticed that it's very sensitive to that, and even a tiny little bit of animation going on in some other window can have a big impact, presumably because it relies heavily on caching.
Come to think of it, that might be a reason for going back to using SMEM... I'll experiment some more. :)
I've just tried it on a 3060 and getting the same perf as a 3090.
FWIW this is on Tensor Dock marketplace just for R&D purposes.
Been playing around with this a bit more, I can get 45-47 t/s with a 7b model and 36-37 t/s on a 13b model on a 4090 using latest drivers in an Ubuntu container w/ AMD 3990X.
@qeternity Could you elaborate on the hardware and software setup? 3990X is quite slow single-threaded, but not that slow. I wouldn't expect less than half the performance of a 12900K. How is it containerized? What's the host system?
Hypervisor hardware I can't be sure about, as this is a cloud GPU.
I would be more than happy to fund your access to GPU time if you're interested.
When I did some profiling last week, it seems that most CPU time is spent shuffling between CPU and GPU in .to()
invocations.
CPU profiling is a little tricky with this. I've run into the same thing when profiling, and it's caused by the fact that .to("cpu")
is a synchronization point. PyTorch basically just waits in a busy loop for the CUDA stream to finish all pending operations before it can move the final GPU tensor across, and then the actual .to() operation takes like a microsecond or whatever. If you insert torch.cuda.synchronize()
right before logits = _move_tensor ...
at the end of model.py
the profiler should show significantly less time spent in .to()
.
There is one other place where a bit of data is moved across, because the embedding table resides in system RAM. You can move it to VRAM as a test by replacing "cpu"
with "cuda:0"
on lines 615, 752 and 953, but in my testing it makes no difference at all, other than it eats up some VRAM. Even on my 3090 which is on a PCIe 3.0 x4 connection I can't measure any difference in performance between copying just the input IDs to VRAM or copying the initial hidden state.
Then again, it could be that this virtualized environment just has really, really poor bandwidth between system RAM and VRAM. I guess you would measure it (crudely) with a script like this:
import torch
import time
count = 100
tensor = torch.rand((1024, 1024), dtype = torch.float, device = "cpu")
tensor_size = tensor.numel() * tensor.element_size()
for i in range(10): # Warmup
tensor = tensor.to("cuda:0")
tensor = tensor.to("cpu")
start = time.time()
for i in range(count):
tensor = tensor.to("cuda:0")
tensor = tensor.to("cpu")
end = time.time()
duration = end - start
bandwidth = 2 * count * tensor_size / duration / 1024**3
print(f"Bandwidth {bandwidth:.4f} GB/s")
I'm getting 4-8 GB/s on the 4090 (PCIe 4.0 x16) and 2-3 GB/s for the 3090 (PCIe 3.0 x4). It oddly depends a lot on the shape of the tensor (!) and is far from the theoretical bus speed, but given that the data copied is on the order of 100 kB per token, I really doubt it's a bottleneck in any case.
I don't really have time to get into optimizing on a cloud instance right now. But I'm rewriting the CUDA backend at the moment, with a bunch of switchable options for CUDA kernels etc. and more code moved to C++ where the performance is more predictable and it will be easier to profile. So there are improvements coming, don't worry. And I will find that CPU bottleneck if it kills me. Because really the CPU shouldn't matter at all here.
Many thanks for all your hard work. Running the above, I am actually getting 9-9.5 GB/s on a 3090 4.0x16
Here is the cProfile of the benchmark segments in test_benchmark_inference.py
You can see model.py:1069(forward)
at 34ms per call which is where our ca. 30 t/s comes from (it's about 15% slower with cProfile enabled).
ncalls tottime percall cumtime percall filename:lineno(function)
57344 0.863 0.000 0.863 0.000 {built-in method exllama_ext.q4v2_matmul}
57568 0.849 0.000 0.849 0.000 {built-in method exllama_ext.column_remap}
288354 0.837 0.000 0.837 0.000 {method 'view' of 'torch._C._TensorBase' objects}
16384 0.571 0.000 0.571 0.000 {built-in method torch.matmul}
57344 0.554 0.000 3.570 0.000 cuda_ext.py:70(_matmul_q4v2_matmul)
8224 0.512 0.000 4.612 0.001 model.py:440(forward)
8224 0.436 0.000 8.064 0.001 model.py:525(forward)
16705 0.419 0.000 0.419 0.000 {built-in method exllama_ext.rms_norm}
74273 0.381 0.000 0.381 0.000 {built-in method torch.empty_like}
8707 0.365 0.000 0.365 0.000 {method 'to' of 'torch._C._TensorBase' objects}
57792 0.365 0.000 0.365 0.000 {built-in method torch.empty}
8224 0.232 0.000 2.257 0.000 model.py:379(forward)
57568 0.209 0.000 3.834 0.000 cuda_ext.py:164(matmul_q4v2)
16448 0.203 0.000 0.203 0.000 {method 'copy_' of 'torch._C._TensorBase' objects}
57568 0.199 0.000 0.199 0.000 model.py:135(_matmul_switch)
16448 0.186 0.000 0.186 0.000 {built-in method exllama_ext.rope}
32896 0.164 0.000 0.164 0.000 {method 'narrow' of 'torch._C._TensorBase' objects}
57568 0.152 0.000 4.257 0.000 model.py:301(forward)
108711 0.145 0.000 0.145 0.000 module.py:1617(__getattr__)
41088 0.134 0.000 0.134 0.000 {method 'transpose' of 'torch._C._TensorBase' objects}
8224 0.125 0.000 0.125 0.000 {built-in method torch._C._nn.silu}
8192 0.110 0.000 0.110 0.000 {method 'softmax' of 'torch._C._TensorBase' objects}
9252 0.065 0.000 0.419 0.000 model.py:848(_move_tensor)
257 0.059 0.000 8.617 0.034 model.py:1069(forward)
16705 0.054 0.000 0.670 0.000 cuda_ext.py:241(llama_rms_norm)
57568 0.051 0.000 0.051 0.000 model.py:246(quant_args)
16705 0.038 0.000 0.708 0.000 model.py:410(forward)
224 0.035 0.000 0.035 0.000 {built-in method exllama_ext.half_matmul_cublas}
8224 0.032 0.000 0.032 0.000 model.py:151(_attn_switch)
16448 0.030 0.000 0.217 0.000 cuda_ext.py:180(rope_)
And with pytorch profiler:
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
cudaLaunchKernel 24.80% 1.401s 24.80% 1.401s 5.634us 248583
aten::empty_strided 9.36% 528.518ms 9.36% 528.518ms 7.043us 75046
aten::view 7.24% 408.761ms 7.24% 408.761ms 1.302us 314022
aten::empty 7.04% 397.295ms 7.04% 397.295ms 6.810us 58340
aten::bmm 6.96% 393.086ms 8.89% 501.867ms 30.512us 16448
cudaMemcpyAsync 5.96% 336.628ms 5.96% 336.628ms 652.380us 516
aten::index_select 4.49% 253.716ms 4.52% 255.220ms 993.074us 257
aten::add 4.09% 230.845ms 5.71% 322.570ms 19.612us 16448
aten::copy_ 3.38% 190.694ms 11.37% 642.158ms 37.214us 17256
aten::matmul 3.27% 184.932ms 16.62% 938.609ms 56.187us 16705
aten::empty_like 2.30% 129.719ms 11.37% 642.358ms 8.645us 74306
aten::silu 2.23% 126.020ms 3.06% 172.930ms 21.027us 8224
aten::transpose 2.17% 122.690ms 2.37% 133.787ms 3.233us 41377
aten::_softmax 1.96% 110.789ms 2.63% 148.583ms 18.067us 8224
aten::div_ 1.84% 103.709ms 2.64% 148.949ms 18.182us 8192
aten::slice 1.74% 98.496ms 1.94% 109.443ms 3.226us 33930
aten::mul_ 1.62% 91.227ms 2.40% 135.656ms 16.495us 8224
cudaMemsetAsync 1.61% 90.957ms 1.61% 90.957ms 5.393us 16865
aten::expand 1.59% 89.582ms 1.68% 95.005ms 2.888us 32898
aten::reshape 1.58% 89.280ms 2.58% 145.878ms 3.482us 41890
aten::narrow 1.16% 65.336ms 2.95% 166.452ms 5.060us 32896
aten::triu 0.59% 33.415ms 0.59% 33.415ms 33.415ms 1
aten::softmax 0.54% 30.332ms 3.16% 178.646ms 21.723us 8224
aten::as_strided 0.49% 27.802ms 0.49% 27.802ms 0.253us 109999
aten::_reshape_alias 0.42% 23.750ms 0.42% 23.750ms 1.444us 16448
aten::_unsafe_view 0.37% 20.747ms 0.37% 20.747ms 1.240us 16737
aten::argmax 0.33% 18.683ms 0.34% 19.194ms 74.977us 256
aten::mm 0.15% 8.630ms 0.22% 12.618ms 49.097us 257
aten::fill_ 0.11% 6.142ms 0.16% 8.971ms 34.771us 258
aten::_to_copy 0.09% 5.160ms 6.59% 372.330ms 481.669us 773
cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFla... 0.06% 3.610ms 0.06% 3.610ms 1.239us 2913
aten::select 0.06% 3.326ms 0.06% 3.628ms 3.536us 1026
cudaStreamSynchronize 0.06% 3.200ms 0.06% 3.200ms 6.202us 516
aten::mul 0.05% 2.599ms 0.06% 3.297ms 51.516us 64
aten::embedding 0.04% 2.444ms 4.57% 258.178ms 1.005ms 257
aten::to 0.04% 2.005ms 6.62% 373.608ms 21.776us 17157
aten::zeros 0.03% 1.656ms 0.23% 12.807ms 49.833us 257
aten::unsqueeze 0.03% 1.504ms 0.03% 1.508ms 2.945us 512
cudaFuncSetAttribute 0.03% 1.497ms 0.03% 1.497ms 5.180us 289
aten::linear 0.03% 1.472ms 0.32% 18.246ms 70.996us 257
aten::_scaled_dot_product_attention_math 0.02% 1.333ms 0.30% 17.145ms 535.781us 32
cudaOccupancyMaxActiveBlocksPerMultiprocessor 0.02% 1.221ms 0.02% 1.221ms 6.359us 192
aten::zero_ 0.02% 1.018ms 0.12% 6.699ms 26.066us 257
aten::t 0.01% 834.000us 0.03% 1.495ms 5.817us 257
aten::clone 0.01% 760.000us 0.04% 2.407ms 75.219us 32
aten::add_ 0.01% 714.000us 0.02% 1.000ms 31.250us 32
aten::scaled_dot_product_attention 0.01% 607.000us 0.31% 17.752ms 554.750us 32
aten::full 0.00% 9.000us 0.06% 3.310ms 3.310ms 1
aten::expand_as 0.00% 5.000us 0.00% 15.000us 15.000us 1
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 5.647s
Another data point: Lambda Labs A100 40GB using AMD EPYC 7J13 does 40 t/s on 7b
Interestingly the GPU util is still hovering around 50% which is higher than I would have expected, and suggests a GPU-bound of 80 t/s. FWIW I have been testing with TheBloke/vicuna-7B-1.1-GPTQ-4bit-128g
As a point of reference for the benchmark proposed earlier: 3090 + 5950x on pcie 4x16: 12.3 GB/s avg P40 + 2650Lv4 on pcie ?x16: 6.2 GB/s avg T600 + 2650Lv4 on pcie ?x4: 2.6 GB/s avg
I am not sure about the generation of pcie on my second system because it should be 3.0, but that joke of a motherboard has the worst wiring possible for all lanes.
7b 4bit on a 4090 + 3900x is now up to 75 t/s! GPU usage hovering around 50% so this could genuinely just be down to 12900K single threaded performance.
EDIT: 13b 4bit is doing 62 t/s with GPU ca. 70%.
@qeternity : It seems likely that it's still CPU-bound somewhere. You're getting a little under half my performance on 7B, then 2/3 on 13B which would spend longer in each CUDA kernel. And with higher GPU utilization too, there's no other way to interpret that.
As for what to do about it, though... Well. So far relying less on PyTorch has been working out, so I can keep doing that. Also I think the overhead from kernel launches is starting to become a bottleneck, so I'm looking into tail launching with CUDA graphs, fusing kernels where it makes sense, and batching operations like cudaMemset (of all things). Multiple streams might also help in some places. There's a way forward at least for addressing the CPU bottleneck.
Python aside, a function like cudaLaunchKernel
would be synchronous, so those 1.4s total would actually be spent just setting up execution contexts, I think. Anything to bring the number of kernel launches down is likely going to make a difference. And VRAM allocation looks like the second-biggest problem after that.
@turboderp I think these results are fantastic fwiw. Aside from binning off PyTorch entirely, or being able to JIT the Torch graph (which isn't currently performant due to the custom CUDA kernels afaict) there's probably not a whole lot of juice left to squeeze.
Edited to add: I say this because being CPU bound on a 4090 I am able to hit 50% GPU util which happens to be ca. 50% of your 4090 results. I suspect that your 4090 utilization is at or near 100% thanks to the 12900K.
I'm pretty sure there's quite a bit more to squeeze, though. Thing is, you shouldn't be CPU bound for this, because the CPU isn't doing anything during inference. It's pretty much just launching kernels one after another. At 5 us per launch it adds up, apparently, but fusing kernels or tail-launching graphs could reduce it by a significant amount. Also I'm getting bottlenecks in the stupidest places. Like, it also takes a (implicit) kernel launch to write a single float value of zero to global memory... just to initialize the accumulator for the norm of a single row. Initializing a cache of 10,000 zeroes would take the same time, so that's a straightforward optimization.
And while I'm getting close to 100% GPU utilization, that doesn't mean it's using the GPU optimally. One thing I haven't really gotten to yet is optimizing for the L1 cache and SMEM. Even L2 cache is worth looking at, since some of the matrices in 65B are about four times as big as the L2 cache on the 4090. Then of course there's the fact that generating text produces a very small hidden state (16 kB for the 65B model), which means there isn't that much to synchronize between GPUs if you attempted to use both of them at once. At least for matmuls. Self-attention would be more difficult to split, but still, something like 50-75% speedup sounds realistic for two identical GPUs.
@turboderp Do you see more or less equal speeds on 4090 and 3090 ti with a 7B? The table on the README shows up to 160t/s.
I tested a 3090 on an EPYC 7302 system and got to 80 t/s. I'm wondering if the CPU is the bottleneck or the GPU.
It's not quite equal yet. There might be some optimization to do on the kernels, but I am getting 127 t/s on the 3090. The EPYC is very slow, though, less than half the single-threaded performance of the 12900K, so that's probably what you're running into.
Despite the fact that the CPU "isn't doing anything" during inference, Python is still really slow, and then Torch's underlying C++ libraries add a little overhead as well. You can see what's happening in this trace:
The "CUDA API" row shows kernel launches. The first 8 launches (from "rmsnorm...") launched one after another in the C++ extension. The launches after that are from PyTorch, and the difference is pretty obvious. It's not that there's any processing happening, it's just Python being slow compared to compiled C++ code.
Here is a complete forward pass for a single token:
During the big red "cudaMemcpyAsync" at the end of the pass the CPU is just waiting in a busy-loop for the CUDA queue to finish so the logits are ready to copy to system RAM. It should ideally be much longer, though. The fact that it's only some 30% of the whole forward pass means that if my CPU were 30% slower, the CPU would become the bottleneck.
That was yesterday, though. Here's where I'm at with the latest version that I pushed a few minutes ago:
The CPU finishes queuing up all the CUDA operations much faster. There's a lot more that could be done, but it involves some headache-inducing strided matmuls that I'm not keen on tackling right now. After that there's also graphs, which can cut the kernel launch overhead to about a third, but I'm hoping this is fast enough for now so I can focus on sampling or something.
Yeah, I tested your latest version and it got me from 80 t/s to 110 t/s on 7B, great work!
Yeah, I tested your latest version and it got me from 80 t/s to 110 t/s on 7B, great work!
which commit do you test? is this commit https://github.com/turboderp/exllama/commit/7805e2bc7763a60ee560554cde64a53f3b885889 ?
It's not quite equal yet. There might be some optimization to do on the kernels, but I am getting 127 t/s on the 3090. The EPYC is very slow, though, less than half the single-threaded performance of the 12900K, so that's probably what you're running into.
Despite the fact that the CPU "isn't doing anything" during inference, Python is still really slow, and then Torch's underlying C++ libraries add a little overhead as well. You can see what's happening in this trace:
The "CUDA API" row shows kernel launches. The first 8 launches (from "rmsnorm...") launched one after another in the C++ extension. The launches after that are from PyTorch, and the difference is pretty obvious. It's not that there's any processing happening, it's just Python being slow compared to compiled C++ code.
Here is a complete forward pass for a single token:
During the big red "cudaMemcpyAsync" at the end of the pass the CPU is just waiting in a busy-loop for the CUDA queue to finish so the logits are ready to copy to system RAM. It should ideally be much longer, though. The fact that it's only some 30% of the whole forward pass means that if my CPU were 30% slower, the CPU would become the bottleneck.
That was yesterday, though. Here's where I'm at with the latest version that I pushed a few minutes ago:
The CPU finishes queuing up all the CUDA operations much faster. There's a lot more that could be done, but it involves some headache-inducing strided matmuls that I'm not keen on tackling right now. After that there's also graphs, which can cut the kernel launch overhead to about a third, but I'm hoping this is fast enough for now so I can focus on sampling or something.
what's the name of the tool in the screenshot?
test with https://github.com/turboderp/exllama/commit/7805e2bc7763a60ee560554cde64a53f3b885889 and llama13b-4bit-128g.safetensors
using test_benchmark_inference.py
on A6000 Ada
** Time, Load model: 2.64 seconds
** Time, Load tokenizer: 0.01 seconds
-- Groupsize (inferred): 128
-- Act-order (inferred): yes
** VRAM, Model: [cuda:0] 6,873.52 MB
-- Warmup pass 1...
** Time, Warmup: 1.60 seconds
-- Warmup pass 2...
** Time, Warmup: 0.59 seconds
-- Warmup pass 3...
** Time, Warmup: 0.58 seconds
-- Inference, first pass.
** Time, Inference: 0.59 seconds
** Speed: 3263.05 tokens/second
-- Generating 128 tokens, 1920 token prompt...
** Speed: 55.43 tokens/second
-- Generating 128 tokens, 4 token prompt...
** Speed: 64.42 tokens/second
** VRAM, Inference: [cuda:0] 1,772.67 MB
** VRAM, Total: [cuda:0] 8,646.19 MB
@pineking : It's NVIDIA Nsight Systems. It's a free as in beer, and the EULA is what you'd expect from NVIDIA, but it is pretty awesome for debugging and profiling CUDA code. They also have Nsight Compute which is more of a kernel profiler.
@pineking Yes, that commit. On https://github.com/turboderp/exllama/commit/ab81db1aa342214ccf41fe5485ae6a75cf570548 I had 80 t/s on 7B on EPYC 7302 and RTX 3090, on https://github.com/turboderp/exllama/commit/7805e2bc7763a60ee560554cde64a53f3b885889 I get 110 t/s.
does someone compared the inference speed of 4bit quantized model with the origin FP16 model? is it faster than the origin FP16 model?