turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.45k stars 257 forks source link

Very poor performance when VRAM is nearly full (inconsistent) #153

Closed QM60 closed 5 months ago

QM60 commented 10 months ago

Just recording something I'm noticing in practical use of exllamav2. It seems like the performance sometimes becomes extremely bad when VRAM is almost at capacity. Tokens/sec dips from 25-30 down to anywhere from 1 to 6. It's not consistent; sometimes it runs at full speed, sometimes not. I can't find any way to predict whether it'll run well. Performance doesn't appear to vary based on the amount of VRAM in use by the DE, browsers, etc. It also never OOMs, only slows down to a crawl. Environment: Windows 10, 4090, driver version 31.0.15.4601 (with sysmem fallback disabled). GPU also used for display. The same thing happens with the older drivers from April that didn't have sysmem fallback, so I don't think offloading is the cause, but it could be a driver issue.

Interestingly, the original exllama never showed this behavior; if I slowly increase context it will switch from running perfectly to OOM without any in between. Transformers, in contrast, seems to be even more sensitive to low memory conditions than EXL2. This makes me think that this could be caused by exllamav2 doing a lot more dynamic vram allocation and running into fragmentation/GC issues.

While I'm sure whatever exl2 is doing can't be nearly as bad as how past_key_values is handled in transformers, there's could be something similar happening. I figured this is worth noting, since the success of 2.4-2.55bpw quants of 70B means there will be a lot of people pushing their VRAM close to the limit.

turboderp commented 10 months ago

Are you sure the sysmem fallback is fully disabled? It's a new option they added and I'm not sure exactly how it's implemented. There's no reason why it wouldn't OoM if it's actually running out of memory, so it sounds like it's still trying to do something.

For reference I'm currently running a 120B model on 2x24GB GPUs, which doesn't leave a lot of free VRAM:

image

I can even push it a little further than that with no slowdown, and as soon as it stops working it will OoM. Note the driver version, though. Is there a chance you could try 530.x to see if you still get the slowdown?

QM60 commented 10 months ago

I'm having trouble finding 530.x now (nvidia only goes back to 532) but I've definitely seen this happen on the older 530 drivers; I was stuck on them for a long time until they finally added the option to disable fallback. Back then, this was especially noticeable for llama1-33b models; I'd experimented with making exl2 quants with them in the hope of unlocking more context, but this weird slowdown got in the way.

In the hope of nailing down the issue, I just tried experimenting with some old GPTQ quants of llama1 33b. EXL1 tops out at 3.7K context, and ooms at 3.8K. EXL2 tops at 3900 and ooms at 3950. But I never see any slowdown on either one, no matter how closely I approach the limit - either it runs perfectly or OOM. So I don't think it's a spillover issue with the driver.

Somehow, exl2 quants (such as 2.4-2.55bpw) suffer slowdown and "soft failure". But as I said, it's not consistent - sometimes I can run 2.55bpw with 6k context and get fast speeds, other times I get chugging at 1k context with 2.4bpw. I wonder if the EXL2 code path is different enough from the GPTQ one to explain this, or if I'm just getting very lucky with GPTQ so far.

Update: I tried to get a apples to apples comparison with a 33B model. One GPTQ (128g) and one EXL2 format (4bpw). Testing various context sizes to try to induce the slowdown.

EXL2 model (4bpw): 4400 tokens is okay, 4500 hits slowdown (sometimes), 4600 OOMs.

GPTQ model: 3938 tokens is okay, 3939 tokens OOMs. Literally, I had to bisect to find the magic number. It runs like an absolute champ at 1 token below the OOM threshold. I couldn't get it to chug at all.

Conclusion: There is definitely something about the exl2 code path which is different. Maybe something dynamic which is a static buffer for GPTQ? I don't know. But GPTQ is impressively rock solid, it needs no headroom whatsoever to work. EXL2 gets weird.

BarfingLemurs commented 10 months ago

@QM60 Have you fully uninstalled and reinstalled the drivers? nvidia recommends this. I knew this problem you are talking about on windows, I switched fron 535 to v528 and noticed swapping was still there, but even worse. From 1.5k to 1k for 2.3bpw on a 3090 before hitting 5 t/s.

Maybe globally disabling the memory fallback doesn't account for everything, or the memory fallback was present pre 531 on certain cards. But none of this exists for Linux drivers, it still works at full speeds up until the last token, still 15 t/s.

QM60 commented 10 months ago

@BarfingLemurs Did switching to 528.x ever actually work for you? Curious if you can confirm a fix on Windows or not. One thing that makes me skeptical about the sysmem hypothesis is that I can't trigger the slowdown with GPTQ models. GPTQ models on Windows via exllamav2 work perfectly as you described: "full speeds up until the last token, still 15 t/s." Only exl2 quants have the slowdown. And even with exl2 quants, I never see visible offloading - they still OOM immediately once space runs out.

It's possible some nvidia driver bug gets triggered only by the exl2 code path. But it's also possible that it's a memory fragmentation issue, and Linux is simply better at handling that. One point in favor of that is that transformers had the same slowdown when running low on vram, long before nvidia even thought up that awful offloading idea.

BarfingLemurs commented 10 months ago

Did switching to 528.x ever actually work for you?

@QM60 It got worse, but I didn't test fully uninstalling and reinstalling drivers.

I would be curious to know how far back the memory offloading goes, has it been in versions from years ago?

BTW, maybe add "Windows" to the title, since these shenanigans haven't been reported on the linux desktop and server drivers.

Maybe different GPUs got this update at different versions? It seems not even sys men fallback fixes the issues, if it behaves like you describe.

turboderp commented 10 months ago

Is it possible there's something else running in the background which changes its behavior when VRAM starts running low? Context switching is very expensive in CUDA, so inference is sensitive to other processes stealing the attention of the GPU. No idea what it would be, but conceivably some other process relies on allocating a temporary VRAM buffer repeatedly, and failing that falls back on some other code path that launches more, smaller CUDA operations?

As for this not happening with GPTQ models, hard to imagine what would be so different. The kernels are different, but there are no tensor allocations in the EXL2 code path aren't also in the GPTQ path. Hidden state is the same shape, attention works the same, and as far as inference is concerned, the branch point is all the way at the kernel launch:

void gemm_half_q_half_cuda_part
(
    ....
)
{
    if (!b->is_gptq)
    {
        dim3 blockDim, gridDim;
        blockDim.x = BLOCK_KN_SIZE;
        blockDim.y = 1;
        blockDim.z = 1;
        gridDim.x = DIVIDE(size_n, BLOCK_KN_SIZE * 4);
        gridDim.y = DIVIDE(size_m, m_count);
        gridDim.z = DIVIDE(size_k, BLOCK_KN_SIZE);

        fp_gemm_half_q_half_kernel kernel = pick_gemm_half_q_half_kernel(true, m_count);

        kernel<<<gridDim, blockDim>>>
        (
            ....
        );
    }
    else
    {
        dim3 blockDim, gridDim;
        blockDim.x = BLOCK_KN_SIZE;
        blockDim.y = 1;
        blockDim.z = 1;
        gridDim.x = DIVIDE(size_n, BLOCK_KN_SIZE * 4);
        gridDim.y = DIVIDE(size_m, m_count);
        gridDim.z = DIVIDE(size_k, BLOCK_KN_SIZE);

        fp_gemm_half_q_half_gptq_kernel kernel = pick_gemm_half_q_half_gptq_kernel(true, m_count);

        kernel<<<gridDim, blockDim>>>
        (
            ....
        );
    }
}

Of course EXL2 has a few extra tensors per linear layer, but they're static and neither Torch nor CUDA does any kind of defragmentation or anything. So if there is some kind of memory shenanigans going on, it must be at the driver level, one way or another.

Would it be possible for someone experiencing this to profile it in Nsight Systems perhaps? That would give some idea what part of the pipeline is stalling, at least.

QM60 commented 10 months ago

I tried profiling the issue. I run 3 generations for each, trying to use the max context I can get away with before OOM. GPTQ is on top and takes ~10 seconds per gen (Airoboros 2.1, 3938 ctx). EXL2 4bpw (same model, 4550 ctx) is on the bottom and takes ~80 seconds per gen.

If I'm reading things right, the runtime is dominated by gemm_half_q_half_kernel for exl2, while the _gptq version seems to be much faster. gptq vs exl2 nsight

turboderp commented 9 months ago

That definitely narrows it down to the driver or the CUDA runtime somehow behaving differently. gemm_half_q_half_kernel would be where all the VRAM loads happen, so this would be consistent with the driver or runtime having to swap memory around, assuming the time spent is attributed to whatever kernel trigger the swapping. At least, if the problem was with Torch or VRAM fragmentation or some such, no extra time should be spent in the kernel itself, since it doesn't deal with any of that.

I guess the next step would be to zoom in and see if some launches of gemm_half_q_half_kernel are taking substantially longer than others or if it's an overall slowdown. Then some repeated tests to see if the same launches are affected each time, and/or kernel profiling in Nsight Compute to dig down deeper.

It's hard, though, when I can't replicate it here.

QM60 commented 9 months ago

Oh, you can zoom in! Yeah it does seem spiky, hence things like the 4ms kernel execution below. But all the runs are much slower than the GPTQ kernel, which also seems more steady.

gptq vs exl2 events

I tried profiling it but the overhead is so bad I could only grab the first few calls to that kernel, it would take forever to start emitting tokens. That means I have no idea if the profiled calls exhibit the issue or not. It doesn't seem that bad to me but what do I know. (Can you guess I know nothing about CUDA?)

nsight compute exl2 kernel

On a more practical note, for those who encounter this issue, it likely will happen again if you just kill the app and restart. BUT, if you force an OOM (say by starting it with too large context) and then relaunch, it has a much better chance of working. No idea what this means, but it offers a practical workaround for now.

I wonder if something in the exl2 kernel requires contiguous memory but is randomly not getting it? OOM seems to clean house in gpu ram so maybe that's why it helps.

turboderp commented 9 months ago

Well, the kernels definitely see a contiguous address space, because that's the only way they work. And CUDA shouldn't automatically use unified memory. However, now I'm thinking that since all global memory buffers are actually allocated as Torch tensors and not explicitly by the extension, perhaps some changes have been made to Torch in recent versions to accommodate the changes introduced by later NVIDIA drivers? What version of PyTorch are you on?

QM60 commented 9 months ago

Torch is 2.1.0+cu121. It makes sense that kernels would see contiguous memory. I just don't know if it's physically real. Given that nvidia drivers supposedly can offload some things to system ram, the question is what memory it's really using. However, the sampling data from nsight doesn't look different even for the long kernel calls (4.0ms). And time spent on "Memory" is not increasing. So if this is happening, it must be so well abstracted that nsight can't see it. Now I'm tempted to turn offloading back on and see if it looks different in nsight.

But for all these ideas, the question is: why does it never happen for the GPTQ kernel?? What is exl2 doing differently? The key has to be there...

turboderp commented 9 months ago

One difference would be in the tensor shapes. EXL2's tensors are a variety of sizes while the GPTQ tensors are much more regular. Basically only two shapes: one for QKVO projections and one for MLP up/gate/down layers. That shouldn't give Torch, the driver or the runtime any special reason to treat them differently, since they're still just static in global memory, but it might trigger different behavior regardless.

turboderp commented 5 months ago

Closing some stale issues..