the inference speed of GPTQ 4bit quantized model

turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.

MIT License

2.73k stars 214 forks source link

the inference speed of GPTQ 4bit quantized model #19

Closed pineking closed 1 year ago

pineking commented 1 year ago

does someone compared the inference speed of 4bit quantized model with the origin FP16 model? is it faster than the origin FP16 model?

qeternity commented 1 year ago

We are seeing ca. 30% speedup at int4 vs. fp16 but nowhere near the benchmarks listed in the readme.

I notice that @turboderp is using a 12900K which has phenomenal single threaded performance. So it appears that we are still CPU bound for most people, which I find perplexing.

turboderp commented 1 year ago

@pineking: The inference speed at least theoretically is 3-4x faster than FP16 once you're bandwidth-limited, since all that ends up mattering is how fast your GPU can read through every parameter of the model once per token. De-quantizing the weights on the fly is cheap compared to the memory access and should pipeline just fine, with the CUDA cores easily doing all the required computation on one batch of weights while waiting for the next batch to load from VRAM.

I should be getting a 3090-Ti tomorrow that has somewhat slower (and fewer) cores than the 4090 but the same memory bandwidth, so I should be able to confirm that it performs about as well as the 4090. Already I've confirmed that a 3070-Ti with half the bandwidth of the 4090 also gets about half the performance.

@qeternity: It's really difficult to profile a CPU bottleneck that I can't see because my CPU is too fast. :) I might try underclocking it at some point, or see if I can't force it to only use E-cores somehow.

You just can't meaningfully measure the time it takes a PyTorch operation to complete since everything actually runs in the CUDA queue. It's also kind of unpredictable the way PyTorch manages resources under the hood. All the automation is very convenient, and the overhead becomes negligible when you're just offloading large batches of computation to a GPU (the typical scenario during training, which appears to be the intended use case for PyTorch), but when going token by token many of the operations end up being quite small, so the overhead per-operation from using a complex framework in an interpreted language becomes significant.

I try to work around that by bypassing PyTorch as much as I can. But this creates another problem of having to manually tune everything and that's a lot of work made much harder by the fact that I don't have 20 different hardware configurations to test on.

The approach I'm going for is to make it as fast as possible on my setup while keeping a catalog of all the methods I've experimented with. Then at some point (or gradually) I can add some of those as options that may work better on systems with slower CPUs, slower system RAM, older GPUs, power-limited GPUs, who knows. Keeping it all in the codebase makes it unmanageable, though. There are already a lot of permutations to validate, and the number doubles with every option added.

I'm still curious what you mean by "nowhere near", though. Exactly what speeds are you getting, and what hardware setup is producing those poor results? What specific models?

qeternity commented 1 year ago

Hi @turboderp - amazing work you've done! Didn't mean to imply otherwise. Thanks very much!

My observation re: CPU bottleneck is not directed at your work, it's something I see in most transformer implementations. And I haven't done any work to assess why this is the case, but your comments make sense to me.

With a 3090 Ti and an Epyc 7282 (very slow), we are getting ca. 34 t/s on a 7b models and 26 t/s on 13b models.

turboderp commented 1 year ago

There must be something else going on, then. That's a little less than half the single-core performance of the 12900K, but you're seeing much less than half the performance. 3090-Ti should be comparable to the 4090 (I'll know for sure tomorrow when it arrives) so there has to be something else slowing it down.

Is anything else using the GPU at the same time? I've noticed that it's very sensitive to that, and even a tiny little bit of animation going on in some other window can have a big impact, presumably because it relies heavily on caching.

Come to think of it, that might be a reason for going back to using SMEM... I'll experiment some more. :)

qeternity commented 1 year ago

I've just tried it on a 3060 and getting the same perf as a 3090.

FWIW this is on Tensor Dock marketplace just for R&D purposes.

qeternity commented 1 year ago

Been playing around with this a bit more, I can get 45-47 t/s with a 7b model and 36-37 t/s on a 13b model on a 4090 using latest drivers in an Ubuntu container w/ AMD 3990X.

turboderp commented 1 year ago

@qeternity Could you elaborate on the hardware and software setup? 3990X is quite slow single-threaded, but not that slow. I wouldn't expect less than half the performance of a 12900K. How is it containerized? What's the host system?

qeternity commented 1 year ago

Hypervisor hardware I can't be sure about, as this is a cloud GPU.

I would be more than happy to fund your access to GPU time if you're interested.

When I did some profiling last week, it seems that most CPU time is spent shuffling between CPU and GPU in .to() invocations.

turboderp commented 1 year ago

CPU profiling is a little tricky with this. I've run into the same thing when profiling, and it's caused by the fact that .to("cpu") is a synchronization point. PyTorch basically just waits in a busy loop for the CUDA stream to finish all pending operations before it can move the final GPU tensor across, and then the actual .to() operation takes like a microsecond or whatever. If you insert torch.cuda.synchronize() right before logits = _move_tensor ... at the end of model.py the profiler should show significantly less time spent in .to().

There is one other place where a bit of data is moved across, because the embedding table resides in system RAM. You can move it to VRAM as a test by replacing "cpu" with "cuda:0" on lines 615, 752 and 953, but in my testing it makes no difference at all, other than it eats up some VRAM. Even on my 3090 which is on a PCIe 3.0 x4 connection I can't measure any difference in performance between copying just the input IDs to VRAM or copying the initial hidden state.

Then again, it could be that this virtualized environment just has really, really poor bandwidth between system RAM and VRAM. I guess you would measure it (crudely) with a script like this:

import torch
import time

count = 100
tensor = torch.rand((1024, 1024), dtype = torch.float, device = "cpu")
tensor_size = tensor.numel() * tensor.element_size()

for i in range(10):  # Warmup
    tensor = tensor.to("cuda:0")
    tensor = tensor.to("cpu")

start = time.time()

for i in range(count):
    tensor = tensor.to("cuda:0")
    tensor = tensor.to("cpu")

end = time.time()
duration = end - start

bandwidth = 2 * count * tensor_size / duration / 1024**3
print(f"Bandwidth {bandwidth:.4f} GB/s")

I'm getting 4-8 GB/s on the 4090 (PCIe 4.0 x16) and 2-3 GB/s for the 3090 (PCIe 3.0 x4). It oddly depends a lot on the shape of the tensor (!) and is far from the theoretical bus speed, but given that the data copied is on the order of 100 kB per token, I really doubt it's a bottleneck in any case.

I don't really have time to get into optimizing on a cloud instance right now. But I'm rewriting the CUDA backend at the moment, with a bunch of switchable options for CUDA kernels etc. and more code moved to C++ where the performance is more predictable and it will be easier to profile. So there are improvements coming, don't worry. And I will find that CPU bottleneck if it kills me. Because really the CPU shouldn't matter at all here.

qeternity commented 1 year ago

Many thanks for all your hard work. Running the above, I am actually getting 9-9.5 GB/s on a 3090 4.0x16

Here is the cProfile of the benchmark segments in test_benchmark_inference.py

You can see model.py:1069(forward) at 34ms per call which is where our ca. 30 t/s comes from (it's about 15% slower with cProfile enabled).

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    57344    0.863    0.000    0.863    0.000 {built-in method exllama_ext.q4v2_matmul}
    57568    0.849    0.000    0.849    0.000 {built-in method exllama_ext.column_remap}
   288354    0.837    0.000    0.837    0.000 {method 'view' of 'torch._C._TensorBase' objects}
    16384    0.571    0.000    0.571    0.000 {built-in method torch.matmul}
    57344    0.554    0.000    3.570    0.000 cuda_ext.py:70(_matmul_q4v2_matmul)
     8224    0.512    0.000    4.612    0.001 model.py:440(forward)
     8224    0.436    0.000    8.064    0.001 model.py:525(forward)
    16705    0.419    0.000    0.419    0.000 {built-in method exllama_ext.rms_norm}
    74273    0.381    0.000    0.381    0.000 {built-in method torch.empty_like}
     8707    0.365    0.000    0.365    0.000 {method 'to' of 'torch._C._TensorBase' objects}
    57792    0.365    0.000    0.365    0.000 {built-in method torch.empty}
     8224    0.232    0.000    2.257    0.000 model.py:379(forward)
    57568    0.209    0.000    3.834    0.000 cuda_ext.py:164(matmul_q4v2)
    16448    0.203    0.000    0.203    0.000 {method 'copy_' of 'torch._C._TensorBase' objects}
    57568    0.199    0.000    0.199    0.000 model.py:135(_matmul_switch)
    16448    0.186    0.000    0.186    0.000 {built-in method exllama_ext.rope}
    32896    0.164    0.000    0.164    0.000 {method 'narrow' of 'torch._C._TensorBase' objects}
    57568    0.152    0.000    4.257    0.000 model.py:301(forward)
   108711    0.145    0.000    0.145    0.000 module.py:1617(__getattr__)
    41088    0.134    0.000    0.134    0.000 {method 'transpose' of 'torch._C._TensorBase' objects}
     8224    0.125    0.000    0.125    0.000 {built-in method torch._C._nn.silu}
     8192    0.110    0.000    0.110    0.000 {method 'softmax' of 'torch._C._TensorBase' objects}
     9252    0.065    0.000    0.419    0.000 model.py:848(_move_tensor)
      257    0.059    0.000    8.617    0.034 model.py:1069(forward)
    16705    0.054    0.000    0.670    0.000 cuda_ext.py:241(llama_rms_norm)
    57568    0.051    0.000    0.051    0.000 model.py:246(quant_args)
    16705    0.038    0.000    0.708    0.000 model.py:410(forward)
      224    0.035    0.000    0.035    0.000 {built-in method exllama_ext.half_matmul_cublas}
     8224    0.032    0.000    0.032    0.000 model.py:151(_attn_switch)
    16448    0.030    0.000    0.217    0.000 cuda_ext.py:180(rope_)

qeternity commented 1 year ago

And with pytorch profiler:

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                       cudaLaunchKernel        24.80%        1.401s        24.80%        1.401s       5.634us        248583
                                    aten::empty_strided         9.36%     528.518ms         9.36%     528.518ms       7.043us         75046
                                             aten::view         7.24%     408.761ms         7.24%     408.761ms       1.302us        314022
                                            aten::empty         7.04%     397.295ms         7.04%     397.295ms       6.810us         58340
                                              aten::bmm         6.96%     393.086ms         8.89%     501.867ms      30.512us         16448
                                        cudaMemcpyAsync         5.96%     336.628ms         5.96%     336.628ms     652.380us           516
                                     aten::index_select         4.49%     253.716ms         4.52%     255.220ms     993.074us           257
                                              aten::add         4.09%     230.845ms         5.71%     322.570ms      19.612us         16448
                                            aten::copy_         3.38%     190.694ms        11.37%     642.158ms      37.214us         17256
                                           aten::matmul         3.27%     184.932ms        16.62%     938.609ms      56.187us         16705
                                       aten::empty_like         2.30%     129.719ms        11.37%     642.358ms       8.645us         74306
                                             aten::silu         2.23%     126.020ms         3.06%     172.930ms      21.027us          8224
                                        aten::transpose         2.17%     122.690ms         2.37%     133.787ms       3.233us         41377
                                         aten::_softmax         1.96%     110.789ms         2.63%     148.583ms      18.067us          8224
                                             aten::div_         1.84%     103.709ms         2.64%     148.949ms      18.182us          8192
                                            aten::slice         1.74%      98.496ms         1.94%     109.443ms       3.226us         33930
                                             aten::mul_         1.62%      91.227ms         2.40%     135.656ms      16.495us          8224
                                        cudaMemsetAsync         1.61%      90.957ms         1.61%      90.957ms       5.393us         16865
                                           aten::expand         1.59%      89.582ms         1.68%      95.005ms       2.888us         32898
                                          aten::reshape         1.58%      89.280ms         2.58%     145.878ms       3.482us         41890
                                           aten::narrow         1.16%      65.336ms         2.95%     166.452ms       5.060us         32896
                                             aten::triu         0.59%      33.415ms         0.59%      33.415ms      33.415ms             1
                                          aten::softmax         0.54%      30.332ms         3.16%     178.646ms      21.723us          8224
                                       aten::as_strided         0.49%      27.802ms         0.49%      27.802ms       0.253us        109999
                                   aten::_reshape_alias         0.42%      23.750ms         0.42%      23.750ms       1.444us         16448
                                     aten::_unsafe_view         0.37%      20.747ms         0.37%      20.747ms       1.240us         16737
                                           aten::argmax         0.33%      18.683ms         0.34%      19.194ms      74.977us           256
                                               aten::mm         0.15%       8.630ms         0.22%      12.618ms      49.097us           257
                                            aten::fill_         0.11%       6.142ms         0.16%       8.971ms      34.771us           258
                                         aten::_to_copy         0.09%       5.160ms         6.59%     372.330ms     481.669us           773
cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFla...         0.06%       3.610ms         0.06%       3.610ms       1.239us          2913
                                           aten::select         0.06%       3.326ms         0.06%       3.628ms       3.536us          1026
                                  cudaStreamSynchronize         0.06%       3.200ms         0.06%       3.200ms       6.202us           516
                                              aten::mul         0.05%       2.599ms         0.06%       3.297ms      51.516us            64
                                        aten::embedding         0.04%       2.444ms         4.57%     258.178ms       1.005ms           257
                                               aten::to         0.04%       2.005ms         6.62%     373.608ms      21.776us         17157
                                            aten::zeros         0.03%       1.656ms         0.23%      12.807ms      49.833us           257
                                        aten::unsqueeze         0.03%       1.504ms         0.03%       1.508ms       2.945us           512
                                   cudaFuncSetAttribute         0.03%       1.497ms         0.03%       1.497ms       5.180us           289
                                           aten::linear         0.03%       1.472ms         0.32%      18.246ms      70.996us           257
               aten::_scaled_dot_product_attention_math         0.02%       1.333ms         0.30%      17.145ms     535.781us            32
          cudaOccupancyMaxActiveBlocksPerMultiprocessor         0.02%       1.221ms         0.02%       1.221ms       6.359us           192
                                            aten::zero_         0.02%       1.018ms         0.12%       6.699ms      26.066us           257
                                                aten::t         0.01%     834.000us         0.03%       1.495ms       5.817us           257
                                            aten::clone         0.01%     760.000us         0.04%       2.407ms      75.219us            32
                                             aten::add_         0.01%     714.000us         0.02%       1.000ms      31.250us            32
                     aten::scaled_dot_product_attention         0.01%     607.000us         0.31%      17.752ms     554.750us            32
                                             aten::full         0.00%       9.000us         0.06%       3.310ms       3.310ms             1
                                        aten::expand_as         0.00%       5.000us         0.00%      15.000us      15.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 5.647s

qeternity commented 1 year ago

Another data point: Lambda Labs A100 40GB using AMD EPYC 7J13 does 40 t/s on 7b

Interestingly the GPU util is still hovering around 50% which is higher than I would have expected, and suggests a GPU-bound of 80 t/s. FWIW I have been testing with TheBloke/vicuna-7B-1.1-GPTQ-4bit-128g

KaruroChori commented 1 year ago

As a point of reference for the benchmark proposed earlier: 3090 + 5950x on pcie 4x16: 12.3 GB/s avg P40 + 2650Lv4 on pcie ?x16: 6.2 GB/s avg T600 + 2650Lv4 on pcie ?x4: 2.6 GB/s avg

I am not sure about the generation of pcie on my second system because it should be 3.0, but that joke of a motherboard has the worst wiring possible for all lanes.

qeternity commented 1 year ago

7b 4bit on a 4090 + 3900x is now up to 75 t/s! GPU usage hovering around 50% so this could genuinely just be down to 12900K single threaded performance.

EDIT: 13b 4bit is doing 62 t/s with GPU ca. 70%.

turboderp commented 1 year ago

@qeternity : It seems likely that it's still CPU-bound somewhere. You're getting a little under half my performance on 7B, then 2/3 on 13B which would spend longer in each CUDA kernel. And with higher GPU utilization too, there's no other way to interpret that.

As for what to do about it, though... Well. So far relying less on PyTorch has been working out, so I can keep doing that. Also I think the overhead from kernel launches is starting to become a bottleneck, so I'm looking into tail launching with CUDA graphs, fusing kernels where it makes sense, and batching operations like cudaMemset (of all things). Multiple streams might also help in some places. There's a way forward at least for addressing the CPU bottleneck.

Python aside, a function like cudaLaunchKernel would be synchronous, so those 1.4s total would actually be spent just setting up execution contexts, I think. Anything to bring the number of kernel launches down is likely going to make a difference. And VRAM allocation looks like the second-biggest problem after that.

qeternity commented 1 year ago

@turboderp I think these results are fantastic fwiw. Aside from binning off PyTorch entirely, or being able to JIT the Torch graph (which isn't currently performant due to the custom CUDA kernels afaict) there's probably not a whole lot of juice left to squeeze.

Edited to add: I say this because being CPU bound on a 4090 I am able to hit 50% GPU util which happens to be ca. 50% of your 4090 results. I suspect that your 4090 utilization is at or near 100% thanks to the 12900K.

turboderp commented 1 year ago

I'm pretty sure there's quite a bit more to squeeze, though. Thing is, you shouldn't be CPU bound for this, because the CPU isn't doing anything during inference. It's pretty much just launching kernels one after another. At 5 us per launch it adds up, apparently, but fusing kernels or tail-launching graphs could reduce it by a significant amount. Also I'm getting bottlenecks in the stupidest places. Like, it also takes a (implicit) kernel launch to write a single float value of zero to global memory... just to initialize the accumulator for the norm of a single row. Initializing a cache of 10,000 zeroes would take the same time, so that's a straightforward optimization.

And while I'm getting close to 100% GPU utilization, that doesn't mean it's using the GPU optimally. One thing I haven't really gotten to yet is optimizing for the L1 cache and SMEM. Even L2 cache is worth looking at, since some of the matrices in 65B are about four times as big as the L2 cache on the 4090. Then of course there's the fact that generating text produces a very small hidden state (16 kB for the 65B model), which means there isn't that much to synchronize between GPUs if you attempted to use both of them at once. At least for matmuls. Self-attention would be more difficult to split, but still, something like 50-75% speedup sounds realistic for two identical GPUs.

0cc4m commented 1 year ago

@turboderp Do you see more or less equal speeds on 4090 and 3090 ti with a 7B? The table on the README shows up to 160t/s.

I tested a 3090 on an EPYC 7302 system and got to 80 t/s. I'm wondering if the CPU is the bottleneck or the GPU.

turboderp commented 1 year ago

It's not quite equal yet. There might be some optimization to do on the kernels, but I am getting 127 t/s on the 3090. The EPYC is very slow, though, less than half the single-threaded performance of the 12900K, so that's probably what you're running into.

Despite the fact that the CPU "isn't doing anything" during inference, Python is still really slow, and then Torch's underlying C++ libraries add a little overhead as well. You can see what's happening in this trace:

Screenshot_20230609_213846

The "CUDA API" row shows kernel launches. The first 8 launches (from "rmsnorm...") launched one after another in the C++ extension. The launches after that are from PyTorch, and the difference is pretty obvious. It's not that there's any processing happening, it's just Python being slow compared to compiled C++ code.

Here is a complete forward pass for a single token:

Screenshot_20230609_214723

During the big red "cudaMemcpyAsync" at the end of the pass the CPU is just waiting in a busy-loop for the CUDA queue to finish so the logits are ready to copy to system RAM. It should ideally be much longer, though. The fact that it's only some 30% of the whole forward pass means that if my CPU were 30% slower, the CPU would become the bottleneck.

That was yesterday, though. Here's where I'm at with the latest version that I pushed a few minutes ago:

Screenshot_20230609_215225

The CPU finishes queuing up all the CUDA operations much faster. There's a lot more that could be done, but it involves some headache-inducing strided matmuls that I'm not keen on tackling right now. After that there's also graphs, which can cut the kernel launch overhead to about a third, but I'm hoping this is fast enough for now so I can focus on sampling or something.

0cc4m commented 1 year ago

Yeah, I tested your latest version and it got me from 80 t/s to 110 t/s on 7B, great work!

pineking commented 1 year ago

Yeah, I tested your latest version and it got me from 80 t/s to 110 t/s on 7B, great work!

which commit do you test? is this commit https://github.com/turboderp/exllama/commit/7805e2bc7763a60ee560554cde64a53f3b885889 ?

pineking commented 1 year ago

It's not quite equal yet. There might be some optimization to do on the kernels, but I am getting 127 t/s on the 3090. The EPYC is very slow, though, less than half the single-threaded performance of the 12900K, so that's probably what you're running into.

Despite the fact that the CPU "isn't doing anything" during inference, Python is still really slow, and then Torch's underlying C++ libraries add a little overhead as well. You can see what's happening in this trace:

The "CUDA API" row shows kernel launches. The first 8 launches (from "rmsnorm...") launched one after another in the C++ extension. The launches after that are from PyTorch, and the difference is pretty obvious. It's not that there's any processing happening, it's just Python being slow compared to compiled C++ code.

Here is a complete forward pass for a single token:

During the big red "cudaMemcpyAsync" at the end of the pass the CPU is just waiting in a busy-loop for the CUDA queue to finish so the logits are ready to copy to system RAM. It should ideally be much longer, though. The fact that it's only some 30% of the whole forward pass means that if my CPU were 30% slower, the CPU would become the bottleneck.

That was yesterday, though. Here's where I'm at with the latest version that I pushed a few minutes ago:

The CPU finishes queuing up all the CUDA operations much faster. There's a lot more that could be done, but it involves some headache-inducing strided matmuls that I'm not keen on tackling right now. After that there's also graphs, which can cut the kernel launch overhead to about a third, but I'm hoping this is fast enough for now so I can focus on sampling or something.

what's the name of the tool in the screenshot?

pineking commented 1 year ago

test with https://github.com/turboderp/exllama/commit/7805e2bc7763a60ee560554cde64a53f3b885889 and llama13b-4bit-128g.safetensors using test_benchmark_inference.py on A6000 Ada

 ** Time, Load model: 2.64 seconds
 ** Time, Load tokenizer: 0.01 seconds
 -- Groupsize (inferred): 128
 -- Act-order (inferred): yes
 ** VRAM, Model: [cuda:0] 6,873.52 MB
 -- Warmup pass 1...
 ** Time, Warmup: 1.60 seconds
 -- Warmup pass 2...
 ** Time, Warmup: 0.59 seconds
 -- Warmup pass 3...
 ** Time, Warmup: 0.58 seconds
 -- Inference, first pass.
 ** Time, Inference: 0.59 seconds
 ** Speed: 3263.05 tokens/second
 -- Generating 128 tokens, 1920 token prompt...
 ** Speed: 55.43 tokens/second
 -- Generating 128 tokens, 4 token prompt...
 ** Speed: 64.42 tokens/second
 ** VRAM, Inference: [cuda:0] 1,772.67 MB
 ** VRAM, Total: [cuda:0] 8,646.19 MB

turboderp commented 1 year ago

@pineking : It's NVIDIA Nsight Systems. It's a free as in beer, and the EULA is what you'd expect from NVIDIA, but it is pretty awesome for debugging and profiling CUDA code. They also have Nsight Compute which is more of a kernel profiler.

0cc4m commented 1 year ago

@pineking Yes, that commit. On https://github.com/turboderp/exllama/commit/ab81db1aa342214ccf41fe5485ae6a75cf570548 I had 80 t/s on 7B on EPYC 7302 and RTX 3090, on https://github.com/turboderp/exllama/commit/7805e2bc7763a60ee560554cde64a53f3b885889 I get 110 t/s.