turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.46k stars 255 forks source link

Quantizing Llama 3.1 405B #565

Closed grimulkan closed 1 month ago

grimulkan commented 1 month ago

Tried to make an EXL2 of it.

I added the fix to the inv_freq scaling that is apparently expected in these models, making the following change in model.py (see https://huggingface.co/v2ray/Llama-3.1-405B for instance):

inv_freq = 1.0 / (base ** (torch.arange(0, head_dim, 2, device = device).float() / head_dim))

# Llama 3.1 fix
def apply_scaling(freqs: torch.Tensor):
    # Values obtained from grid search
    scale_factor = 8
    low_freq_factor = 1
    high_freq_factor = 4
    old_context_len = 8192  # original llama3 length

    low_freq_wavelen = old_context_len / low_freq_factor
    high_freq_wavelen = old_context_len / high_freq_factor
    new_freqs = []
    for freq in freqs:
        wavelen = 2 * math.pi / freq
        if wavelen < high_freq_wavelen:
            new_freqs.append(freq)
        elif wavelen > low_freq_wavelen:
            new_freqs.append(freq / scale_factor)
        else:
            assert low_freq_wavelen != high_freq_wavelen
            smooth = (old_context_len / wavelen - low_freq_factor) / (
                high_freq_factor - low_freq_factor
            )
            new_freqs.append((1 - smooth) * freq / scale_factor + smooth * freq)
    return torch.tensor(new_freqs, dtype=freqs.dtype, device=freqs.device)     
print("Applying Llama 3.1 fix to positional embeddings")       
inv_freq = apply_scaling(inv_freq)

Got the old Hessian is not invertible error:

--------------------------------------------
| Measured: model.layers.0 (Attention)     |
| Duration: 65.75 seconds                  |
| Completed step: 1/255                    |
| Avg time / step (rolling): 65.75 seconds |
| Estimated remaining time: 278min 20sec   |
| Last checkpoint layer: None              |
--------------------------------------------
 -- Layer: model.layers.0 (MLP)
 !! Warning: Applied additional damping
 !! Warning: Applied additional damping
 !! Warning: Applied additional damping
 !! Warning: Applied additional damping
 !! Warning: Applied additional damping
 !! Warning: Applied additional damping
 !! Warning: Applied additional damping
 !! Warning: Applied additional damping
 !! Warning: Applied additional damping
 !! Warning: Applied additional damping
Traceback (most recent call last):
  File "Z:\Code\exllamav2\exllamav2\conversion\adaptivegptq.py", line 292, in prepare
    hessian_inv = torch.linalg.cholesky(hessian)
torch._C._LinAlgError: linalg.cholesky: The factorization could not be completed because the input is not positive-definite (the leading minor of order 40962 is not positive-definite).

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "Z:\Code\exllamav2\convert.py", line 1, in <module>
    import exllamav2.conversion.convert_exl2
  File "Z:\Code\exllamav2\exllamav2\conversion\convert_exl2.py", line 256, in <module>
    status = measure_quant(job, save_job, model, args.hidden_state_offload_layers)  # capturing the graceful exits
  File "Z:\Code\exllamav2\venv\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "Z:\Code\exllamav2\exllamav2\conversion\measure.py", line 610, in measure_quant
    m = measure_mlp(module, hidden_states, target_states, quantizers, cache, attn_params)
  File "Z:\Code\exllamav2\exllamav2\conversion\measure.py", line 216, in measure_mlp
    quantizers["down_proj"].prepare()
  File "Z:\Code\exllamav2\exllamav2\conversion\adaptivegptq.py", line 330, in prepare
    raise ValueError("Hessian is not invertible")
ValueError: Hessian is not invertible

Obviously, I don't actually expect to fix this before the model is even officially released, but leaving it here for whenever Turbo can get to it :)

turboderp commented 1 month ago

It's possible the matrix is simply so large it has to be inverted in FP64 precision. Or the model needs more calibration data to work with during measurement. You could try increasing rows_random on line 82 of exllamav2/conversion/tokenize.py. Give it a value like 20.

But yes, it's going to take some experimenting, and each experiment in this case could take days to run. It'll be a lot easier once the smaller models are out so I can first of all verify that the RoPE scaling is working as intended with an unquantized 8B model.

I'm not sure how much time I want to dedicate to it, though, given how few people will be able to even run the model in ExLlama. Moreover it's not really a good fit for ExLlama to begin with. Even with an array of 4090s you're only going to get a few tokens per second since there's no tensor parallelism.

grimulkan commented 1 month ago

FWIW I can help test/experiment if you or anyone else is interested. Settingrows_random to 20 did not help (same error as before EDIT: Also tried a value of 100 with the same error). If you have any other specific suggestions, I'll be happy to try. Or we could just wait for 8B, 70B and see if those fixes translate to 405B.

It'll be a lot easier once the smaller models are out so I can first of all verify that the RoPE scaling is working as intended with an unquantized 8B model.

I also cannot run the unquantized 405B model on a single node to verify that the rope scaling is working as intended.

Moreover it's not really a good fit for ExLlama to begin with. Even with an array of 4090s you're only going to get a few tokens per second since there's no tensor parallelism.

Good point. If it comes down to the same tok/s as GGUF on a 12 or 8 channel DDR4/5 machine, there's no reason to waste GPUs. My guess is its still slightly faster though, worth trying to see if its possible/comparable.

405B will be good for generating synthetic data (SFT, distillation log probs), if nothing else, and EXL2 will make it possible to run it in a single node (egs., with 4-8x48GB, or even many 3090s with bifurcation/PLX). Even if it is still only naive MP, that's fine for single batch inference. It is a niche use case as you say. Most people who want to actually run it perhaps use data center GPUs and custom tensor parallel code. I'm sort of stuck in between with my setup.

EDIT: If we do get it to work, does the Exllama inference server support queuing up requests to fill a pipeline? That is, schedule the first X layers on GPU 0 on the 2nd request, when the 1st request clears GPU 0 and enters GPU1. I've never felt the need before because the model parallel chains never got too long with the smaller models and Exllama. That would probably smoke GGUF effective throughput.

EDIT 2: Tried float64 cholesky, but OOM on a 48GB GPU :( Will keep experimenting. Maybe there's way to reduce VRAM consumption for the 2nd attempt (but you've probably already done what you can). It looks like its just the giant MLP with the 53K dimension (vs 28K on 70B). The attention tensors seem to quantize without the error.

EDIT 3: Just noticed you automatically move the tensor to a fresh GPU if available on an OOM. With this, I was able to test float64 choleksy, and it still fails (matrix not positive definite). Will try more diagonal damping.

turboderp commented 1 month ago

It's for sure going to be faster than CPU inference, since it will be memory-bound in any case. I think you can push a high-end CPU server to about 500 GB/s which is still only half the bandwidth of a 4090. But ideally tensor parallelism gives you effectively some multiple of the bandwidth of each individual GPU. And for prompt ingestion and batching you'd still end up compute bound, and the CPU server would fall much farther behind.

With eight 4090s you'd have up to 8 TB/s in total, and at 4 bpw (200 GB of weights), that means your theoretical upper limit is like 40 t/s. Even with 25% overhead (?) for attention and synchronization, that's still very usable.

There isn't currently any support for pipeline parallelism in the engine. It's not significantly faster to run two staggered forward passes than to batch them into one, it just generates twice the heat. I'm sure there's a batch size at which it starts to pay off, but the added complexity would be comparable to implementing tensor parallelism, so I don't think it's worth pursuing.

It's conceivable that the inversion fails because the hidden state is overflowing. I had to deal with this for Gemma2 by adding an option to keep the residual stream in FP32, but currently it's only implemented for models with post norms (i.e. Gemma2). But it could be worth checking for inf and NaN values in the hidden state, e.g. before and after the MLP layernorm.

grimulkan commented 1 month ago

For PP and staggered batching, you’re right. I forgot you still need to save kv caches for all batches on each gpu. For some reason I thought staggering would be less VRAM heavy than normal batching.

For TP, there’d be quite a bit chatter p2p. Unless you have nvlink/switch, you’d be p2p pcie bandwidth bottlenecked on non-datacenter gpus. Don’t know if that slows it down to the same as naive MP in Exllama.

Will look for nans.

turboderp commented 1 month ago

Also make sure to install flash-attn if you haven't already. There are Windows wheels here. If you're already using it, maybe try disabling it. :shrug:

grimulkan commented 1 month ago

Was using flash-attn. Didn't know there were Windows wheels now! I've been building my own.

So apparently when torch.linalg.cholesky() fails in this case (The factorization could not be completed because the input is not positive-definite (the leading minor of order 40961 is not positive-definite)), it wrote NaNs into the input tensor (hessian). I don't know if that's a problem in my setup or what. So any further operations like damping or moving it to another GPU and casting it to float64 carried the NaNs over. The original self.hessian doesn't have any NaNs though.

Will post here if I figure it out.

turboderp commented 1 month ago

It could be that it needs the FP32 residual stream. I can't look into it until tomorrow though.

For now I've added the changes to support the new Llama3.1 RoPE scaling method, which could also be relevant. The 8B version works at short contexts without it, but who knows if maybe 405B is more sensitive? You definitely want to pull those changes in any case.

grimulkan commented 1 month ago

Eyeballing the commit it looks to be the same as the "fix" I had in the first post, so I probably already had it. But will pull and try again with the latest.

Take your time, thanks for the support. I'll check quants on the smaller models in the mean time.

EDIT: 70B quantizes just fine, so it's just the down_proj on the 405B that gets stuck.

EDIT2: GPTQ quantization of the 405B works fine with Exllama (6x48GB for 48K of context). I get about 3 tok/s at ~22K context, though prompt ingestion took FOREVER (the first time). That's totally usable, especially with not having to drop the first bits of the conversation with the long context, preserving the KV cache.

Would love to get more control with EXL2 quant options. In contrast, the AWQ implementation seems to be quite inefficient when I tried it, both in speed and VRAM usage (or maybe its just me).

EDIT3: I also tried a bunch of different ways of computing cholesky & inverse in 64-bit (blockwise cholesky, psuedo inverse), but it seems the hessian is already very poorly conditioned by construction. So you are probably correct: the residual stream itself may need to be in 32-bit. Of course, if I add enough identity matrix it will invert, but not sure that's accurate anymore.

turboderp commented 1 month ago

Adding damping regularizes the matrix and reduces the impact of the GPTQ error compensation, essentially making it more and more like RTN quantization. So it's not that it's incorrect, and in fact it will be more accurate with respect to the weights, just less accurate with respect to activations (which would be the point of GPTQ). You'll still have the benefits of act-order quantization and variable bitrate and all that.

You could try setting clamp_hidden_states = True in the architecture definition for Llama. That will at least get rid of any any inf values in the FP16 hidden states going out of the attn and MLP blocks.

I'll add FP32 residual as an option probably over the weekend.

grimulkan commented 1 month ago

More info: Trying without flash-attn and with clamp_hidden_states = True didn't change the situation. Also, I misspoke: adding more damping only changes the error message and it still fails. It also kept bothering me why the input tensor was being populated by NaNs upon cholesky failure, which should not normally happen.

Then I tried this:

inv = torch.linalg.cholesky(torch.eye(53248, 53248, dtype=torch.float32, device='cuda:0'))

It fails. The error message says it is not a positive definite matrix, but I think that's a red herring. I think we're dealing with some internal limitation of linalg.cholesky, and not necessarily a precision problem. Maybe wrong error handling internally.

Would be nice to check if any one else sees this.

This version:

inv = torch.linalg.cholesky(torch.eye(47500, 47500, dtype=torch.float32, device='cuda:0'))

inverts fine. A dim of 48000 does not. This is true for both float32 and float64.

I tried this on both a 24GB and 48GB GPU, and the limit is the same, so not an internal OOM.

Maybe this is not about the residual stream at all. I'm on torch==2.3.1, CUDA 12.1.

EDIT: Some more results:

torch.linalg.pinv also fails at this size.

This code:

q,r = torch.linalg.qr(torch.eye(53248, 53248, dtype=torch.float32, device='cuda:0'))
q = torch.linalg.inv(r) @ q.T

works at a size of 48000 while cholesky does not, but it does OOM at 53248 on a 48GB GPU. Similarly torch.linalg.solve(hessian, identity) also works, but OOMs.

However this code based on LU factorization seems to invert using less VRAM than above, in that I at least get the identity matrix back:

lu, pivots = torch.linalg.lu_factor(torch.eye(53248, 53248, dtype=torch.float32, device='cuda:0'))
identity = torch.eye(53248, device='cuda:0')
identity = torch.linalg.lu_solve(lu, pivots, identity)

When tested in adaptivegptq.py I had to do some gymnastics to avoid an OOM, but this code worked:

while not done:
    try:
        hessian = hessian.to(torch.device(current_device))

        if hessian.shape[0] > 47500: # Fix for very large matrices
            if current_device == 0 and max_devices > 1:
                current_device += 1 # Pre-emptive move to next GPU to save some time
                self.hessian_device = current_device
                hessian = hessian.to(torch.device(current_device))
            hessian, pivots = torch.linalg.lu_factor(hessian) # Overwrite hessian, otherwise not enough VRAM
            hessian_inv = torch.eye(hessian.shape[0], device=hessian.device)
            hessian_inv = torch.linalg.lu_solve(hessian, pivots, hessian_inv)
            # Free up VRAM (note that we will error out if the damping wasn't sufficient, and we need to reprocess hessian)
            hessian, pivots = None, None
        else:
            hessian_inv = torch.linalg.cholesky(hessian)
            hessian_inv = torch.cholesky_inverse(hessian_inv)

This inverts. No change to precision needed. But it barely fits in a 48GB GPU, with nothing else allocated.

Will keep testing, maybe for more memory efficient inversion methods. If this is real and not something wrong with my setup, I wonder how HuggingFace was able to make a GPTQ quant of this model. Maybe on an 80GB GPU with a different inversion method.

edk208 commented 1 month ago

@grimulkan this command works for me.

Python 3.11.7 (main, Dec 15 2023, 18:12:31) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> inv = torch.linalg.cholesky(torch.eye(53248, 53248, dtype=torch.float32, device='cuda:0'))
>>> 

nvidia-smi shows vRAM usage |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 4090 Off | 00000000:2D:00.0 Off | Off | | 0% 37C P8 3W / 450W | 22162MiB / 24564MiB | 0% Default | | | | N/A |

torch=2.3.0.dev20240120+cu121 cuda=12.1

grimulkan commented 1 month ago

Thanks @edk208 that's really strange. Can you also confirm your numpy version ~and OS~ (I saw you're on linux)?

Maybe this is not an exllama problem at all. Not sure what it could even be.

I tried on torch==2.3.0 and got the same result. Cholesky does work on the CPU, so that's an option for me as well.

edk208 commented 1 month ago

numpy 1.26.3... can confirm it worked on a different machine (HPC cluster of A40s and A100s). I think I passed your roadblock, using the main branch.

The A40 ran out of memory but could continue on using the next device. Unfortunately, I have limited time on this cluster so the measurement job will be killed before it finishes. But at least we know it works. I will try to launch more jobs and finish the quants.

--------------------------------------------
| Measured: model.layers.0 (Attention)     |
| Duration: 64.29 seconds                  |
| Completed step: 1/255                    |
| Avg time / step (rolling): 64.29 seconds |
| Estimated remaining time: 272min 10sec   |
| Last checkpoint layer: None              |
--------------------------------------------
 -- Layer: model.layers.0 (MLP)
 !! Out of memory (H), moving to device 1
 -- model.layers.0.mlp.gate_proj                       0.05:3b_64g/0.95:2b_64g s4                         2.11 bpw
 -- model.layers.0.mlp.gate_proj                       0.1:3b_64g/0.9:2b_64g s4                           2.16 bpw
 -- model.layers.0.mlp.gate_proj                       0.1:4b_128g/0.9:3b_128g s4                         3.13 bpw
 -- model.layers.0.mlp.gate_proj                       0.1:4b_32g/0.9:3b_32g s4                           3.23 bpw
 -- model.layers.0.mlp.gate_proj                       1:4b_128g s4                                       4.03 bpw
 -- model.layers.0.mlp.gate_proj                       1:4b_32g s4                                        4.13 bpw
 -- model.layers.0.mlp.gate_proj                       0.1:5b_128g/0.9:4b_128g s4                         4.13 bpw
 -- model.layers.0.mlp.gate_proj                       0.1:5b_32g/0.9:4b_32g s4                           4.23 bpw
 -- model.layers.0.mlp.gate_proj                       0.1:6b_128g/0.9:5b_128g s4                         5.13 bpw
 -- model.layers.0.mlp.gate_proj                       0.1:6b_32g/0.9:5b_32g s4                           5.23 bpw
 -- model.layers.0.mlp.gate_proj                       1:6b_128g s4                                       6.03 bpw
 -- model.layers.0.mlp.gate_proj                       0.1:8b_128g/0.9:6b_128g s4                         6.23 bpw
 -- model.layers.0.mlp.gate_proj                       1:8b_128g s4                                       8.03 bpw
 -- model.layers.0.mlp.up_proj                         0.05:3b_64g/0.95:2b_64g s4                         2.11 bpw
 -- model.layers.0.mlp.up_proj                         0.25:3b_64g/0.75:2b_64g s4                         2.31 bp
grimulkan commented 1 month ago

Thank you very much for checking.

I can do the cholesky fine in WSL, on the exact same machine. But the native windows version fails. No idea what that is. Maybe some kind of driver bug.

Unfortunately WSL does not support TCC-mode GPUs or P2P.

I'll close this issue once the quant finishes. If anyone else on Windows has a similar problem, do post.

turboderp commented 1 month ago

It sounds suspiciously like an integer overflow bug. 47000**2 is close to 2^31 (though on the wrong side so, hm!) and I have had issues with tensors larger than 2^31 elements that manifest specifically on Windows but not WSL.

Safetensors issue here went stale, and I don't really have time to go chasing the bug upstream, but presumably it originates with numpy, and possibly it's related to what you're seeing.

Also it's worth confirming that this line actually works:

                hessian = hessian.to(torch.device(current_device))

I've had to add the exllamav2.compat.safe_move_tensor function to work around P2P issues during inference, but I haven't used it anywhere in the quantization script. Perhaps also worth trying, because it might be that the tensor simply isn't getting copied to the second GPU.

grimulkan commented 1 month ago

Oh yes, I've definitely encountered the P2P copy issue, and checked that it was transferring correctly. I also moved it explicitly via the CPU to confirm.

Thanks for the tip on 2**31. https://github.com/numpy/numpy/issues/8433 might be relevant.

I can indeed replicate the issue mentioned there for Numpy 1.26.4, but it is gone in WSL, and in Numpy 2.0 in Windows. Let me see if I can find a Numpy 2.0 compatible version of Pytorch to test (which is a problem because of https://github.com/pytorch/pytorch/issues/131668).

EDIT: Asking upstream https://github.com/pytorch/pytorch/issues/131774

edk208 commented 1 month ago

update: was able to compute the measurements.json after about 13hrs on a dual A40. I think i have to redo it though because I didn't set the ropescale in the convert process. I think it should have been ropescale 8.

besides that, moving forward with the quants failed on the sanity check.

try:
        if quant_w.numel() <= 1e9:
            ident = torch.eye(recons_linear.in_features, dtype = torch.half, device = r_device)
            recons_w2 = recons_linear.forward(ident, force_cuda = True)
            recons_w2.sub_(quant_w)
            if recons_linear.has_bias: recons_w2.sub_(recons_dict["bias"])
            recons_w2.abs_()
            diff2 = torch.max(recons_w2)
        else:
            diff2 = 0

it craps out with this error only on the down_proj,

  File "/root/exllamav2/exllamav2/conversion/quantize.py", line 105, in quant_linear
    recons_w2.sub_(quant_w)
RuntimeError: CUDA error: an illegal memory access was encountered

I confirmed that recons_w2 and quant_w are the same size, same device, so this error is confusing. The down_proj is right below that 1e9 size limit. I get this same error on a dual A40, 48GB vRAM each and also on a dual A100 80 GB. I can skip the check by setting the size limit to 5e8, and it seems to pass through fine. I don't like the skip, but out of ideas.

grimulkan commented 1 month ago

I’ll post my results too in a few hours and whether or not I run into the same issue. If you used meta’s official release, the config.json has the scale factor in it if that’s what you meant. The latest exllama reads it and should compute rope correctly automatically. Its not a simple scaling like previous rope or theta scaling methods (like the code up top).

While testing GPTQ quants, the 405B does seem to want the correct rope wheres the smaller models seem more forgiving (though they were less or not quantized in my test). The 405B seems absolutely great at retrieving details over long context compared to 70B when rope is set correctly, so far with limited testing. It confuses similar details in the context far less, and can usually spot its own error upon self reflection, in instances where the 70B never does. It’s also “only” ~2.5x slower than 70B in exllama (in GPTQ). For shorter context and many other tasks, the 70B seems pretty close.

turboderp commented 1 month ago

The skip is only there because the sanity check goes OoM on very large matrices on 24 GB GPUs and I couldn't make it fit otherwise. For instance the output matrix of Command-R+ is 12288*256000 elements (3.15e9), but the matmul kernel still works. I've also gone over it at length and I can't find anywhere the kernel would have a problem with the dimensions.

If you do skip the sanity check and it manages to start quantizing layers, the reported RFN error after each module should also give an indication of whether or not the quantization is working. After the last layer you'll also get the "calibration perplexity" which is calculated using the calibration dataset and the quantized model. If that's a reasonable number (I would expect less than 10 for this model), then the quantization probably worked.

If it is the case that the matmul kernel can't handle the matrix, the linear, attn and mlp modules all have codepaths that avoid using the matmul kernel, so the quantized model could be tested that way.

edk208 commented 1 month ago

thanks for the reply. in terms of the rope scale, this is what is mentioned in the convert documentation,

-rs / --rope_scale float: RoPE scaling factor to apply to base model for calibration. This settings is not automatically read from the model's config, so it's strongly recommended that you check what setting the model was trained/finetuned with. E.g.: deepseek-coder uses a scaling factor of 4, so will be incorrectly calibrated if you convert it without -rs 4.

Is this not the case with llama 3.1? Thanks!

turboderp commented 1 month ago

That is a different type of arbitrary scaling factor that scales the position IDs for any model.

Llama3.1 uses its own scheme with a different scale for each frequency. The parameters for that are given by the model's config and automatically applied when that key is present. So you don't need to use the rope scale parameter when quantizing (or inferencing.)

grimulkan commented 1 month ago

I encountered an error in the same code block on quantizing MLP downproj, but my error was different:

 -- Layer: model.layers.7 (Attention)
 -- Linear: model.layers.7.self_attn.q_proj -> 0.1:3b_64g/0.9:2b_64g s4, 2.17 bpw
 -- Linear: model.layers.7.self_attn.k_proj -> 0.1:3b_64g/0.9:2b_64g s4, 2.18 bpw
 -- Linear: model.layers.7.self_attn.v_proj -> 0.25:3b_64g/0.75:2b_64g s4, 2.33 bpw
 -- Linear: model.layers.7.self_attn.o_proj -> 0.1:3b_64g/0.9:2b_64g s4, 2.17 bpw
 -- Module quantized, rfn_error: 0.000780
 -- Layer: model.layers.7 (MLP)
 -- Linear: model.layers.7.mlp.gate_proj -> 0.05:3b_64g/0.95:2b_64g s4, 2.11 bpw
 -- Linear: model.layers.7.mlp.up_proj -> 0.05:3b_64g/0.95:2b_64g s4, 2.11 bpw
 -- Linear: model.layers.7.mlp.down_proj -> 0.05:6b_32g/0.2:3b_64g/0.75:2b_64g s4, 2.47 bpw
 ## Quantization error (2)

Originating from this line:

    if diff1 > 0.05 or diff2 > 0.075:
        print(" ## Quantization error (2)")
        os._exit(1)
    elif diff1 > 0.01 or diff2 > 0.01:
        print(f" !! Warning, difference of ({diff1:.6f}, {diff2:.6f}) between unpacked and dequantized matrices")

In my case diff1 was 0.0 and diff2 was 0.7133789. Strangely, the illegal memory access error was what I initially got when the 53k x 53k Cholesky failed, but the sub_() operation here doesn't fail for me in that way on Windows (maybe bugs out in other ways).

RFN error for all previous layers were below 0.003. If I skip the check as @edk208 mentioned, I get:

 -- Layer: model.layers.7 (MLP)
 -- Linear: model.layers.7.mlp.gate_proj -> 0.05:3b_64g/0.95:2b_64g s4, 2.11 bpw
 -- Linear: model.layers.7.mlp.up_proj -> 0.05:3b_64g/0.95:2b_64g s4, 2.11 bpw
 -- Linear: model.layers.7.mlp.down_proj -> 0.05:6b_32g/0.2:3b_64g/0.75:2b_64g s4, 2.47 bpw
 -- Module quantized, rfn_error: 0.000898
 -- Saving checkpoint...
 -- Layer: model.layers.8 (Attention)
 -- Linear: model.layers.8.self_attn.q_proj -> 0.1:3b_64g/0.9:2b_64g s4, 2.17 bpw
 -- Linear: model.layers.8.self_attn.k_proj -> 0.1:3b_64g/0.9:2b_64g s4, 2.18 bpw
 -- Linear: model.layers.8.self_attn.v_proj -> 0.25:3b_64g/0.75:2b_64g s4, 2.33 bpw
 -- Linear: model.layers.8.self_attn.o_proj -> 0.1:3b_64g/0.9:2b_64g s4, 2.17 bpw
 -- Module quantized, rfn_error: 0.000713
 -- Layer: model.layers.8 (MLP)
 -- Linear: model.layers.8.mlp.gate_proj -> 0.1:3b_64g/0.9:2b_64g s4, 2.16 bpw
 -- Linear: model.layers.8.mlp.up_proj -> 0.3:3b_64g/0.7:2b_64g s4, 2.36 bpw
 -- Linear: model.layers.8.mlp.down_proj -> 0.05:5b_32g/0.95:3b_32g s4, 3.23 bpw
 -- Module quantized, rfn_error: 0.000866

and it seems to continue onward. RFN seems reasonable.

Calibration finished though, here is my measurement.json for anyone interested.

EDIT: Also I am using a pair of Ada 6000s for this. I think same VRAM as A40 but Ada vs Ampere gen. The data-center equivalent would be an L40. Don't know if that makes a difference. I think only the fp8 engine is architecturally different.

Also interesting you finished calibration in 13 hours. Mine took more like 18, and the Ada 6000 is supposed to have more compute and mem bandwidth. Maybe Windows is slower for some large matrix ops.

edk208 commented 1 month ago

I looked at the calibration at the 13hr mark, then did some other stuff and when i came back it was done. It could have taken several more hours, not sure exactly on the timing.

4.0bpw quant seems to have been successful, final perplexity.

Module quantized, calibration perplexity (quant): 4.9647

uploading to huggingface with the measurements too, https://huggingface.co/ek826/Meta-Llama-3.1-405B-Instruct-4.0bpw-exl2

I'll probably run some inference tests over the weekend.

grimulkan commented 1 month ago

Done here too! At 6 bits (and 6 head_bits): -- Module quantized, calibration perplexity (quant): 4.6322 Will quantize more versions and compare PPL.

The 6-bit 405B model @ full 128K context loads in 8x48GB (Ada 6000) using 4-bit KV cache with some room to spare (total 344.6GB, and the 1st GPU needs a bit of extra room, for the giant unquantized embed layer I imagine).

For some reason, while loading the 405B HF GPTQ version in exllama, I would get random OOMs, even when there is plenty of VRAM left. Usually this happens when rolling over from one GPU to the next while loading weights. Not sure if its a fragmentation issue, some memory leak/spike in safetensors or something else. I had to fiddle around with the gpu_split sizes, but even that wasn't reliable. The EXL2 version does not seem to have this problem and loads smoothly, even up against the VRAM limit.

I get ~2-3 tokens/sec with the 6-bit EXL2 model @ 48K of context (same as the 4-bit GPTQ from huggingface). To compare, 6-bit 70B model @ 48K gives me about 9-10 tokens/sec on two of those GPUs. All single batch in exllama_hf streaming mode (kind of a worst case).

Ingesting 32K of context took a whopping ~900sec with the 6-bit model (with PCIe Gen4 x8), but that overhead goes away if you can re-use the cache. I imagine the 4-bit version will do better in ingestion (edit: actually, it's the same). nvidia-smi reports 4-6 of the 8 GPUs at 100% utilization at the same time during prompt ingestion. I thought only one GPU could be used at a time! exllama magic?

grimulkan commented 1 month ago

@turboderp is there a way to measure the PPL in layer streaming mode for the fp16 model, using the same text as the calibration data? Or maybe some way to extract the exact same calibration data so I can measure PPL on it externally? That way I could see how close the 6-bit gets to the ideal, without needing to re-measure each quant.

turboderp commented 1 month ago

I'm not sure why you're getting 100% GPU utilization. I would guess nvidia-smi is misreporting it.

It would be a bit of work to extract the calibration data and ensure that it's formatted precisely correctly, and that the perplexity measurement is exactly the same in the test script as in the quantization script. You can just do a short test, though (the calibration perplexity is a short test as well):

python test_inference.py -m <fp16_model> -ed wikitext.parquet -er 5 -el 1024 -l 1024 -sl
python test_inference.py -m <6bpw_model> -ed wikitext.parquet -er 5 -el 1024 -l 1024 -sl
python test_inference.py -m <6bpw_model> -ed wikitext.parquet -er 5 -el 1024 -l 1024 -gs auto

That should only take a couple of minutes, I think, and still give a good indication of how successful the quantization was.

grimulkan commented 1 month ago

I'm not sure why you're getting 100% GPU utilization. I would guess nvidia-smi is misreporting it.

This is only during prompt ingestion. I see a lot of shuttling back and forth between the GPUs. Is that normal?

Looking at the utilization vs time, it looks like it does only use 1 GPU at a time, and possibly nvidia-smi just happens to sample over a longer period of time for its display and merges the numbers. But I'd have thought we would process all layers in one GPU, then move on to the next, and we don't really have to return multiple times. Maybe I misunderstood how it works. I didn't notice this in 70B, but then I never ran that on more than 2 GPUs.

Quantization looks to be working, here are the PPL results:

f1 f2

I know perplexity isn't everything, but this suggests 405B is still pretty size optimal, i.e., quantizing a larger model down could be better than using a smaller model.

@edk208 you had a few different bpw than mine, I'll test your models too. It is worth it to map out the 2-6 bit space (or maybe below 2?).

I think EXL2 405B quants are working as expected, in that they are beating earlier EXL2 quants of 70B and pretty close to the fp16.

grimulkan commented 1 month ago

This is only during prompt ingestion. I see a lot of shuttling back and forth between the GPUs. Is that normal? Looking at the utilization vs time, it looks like it does only use 1 GPU at a time, and possibly nvidia-smi just happens to sample over a longer period of time for its display and merges the numbers. But I'd have thought we would process all layers in one GPU, then move on to the next, and we don't really have to return multiple times. Maybe I misunderstood how it works. I didn't notice this in 70B, but then I never ran that on more than 2 GPUs.

More details: looks like this for 405B and takes ~15 minutes to ingest a long prompt (blue lines indicate GPU utilization). Plenty of back and forth between GPUs before outputting any tokens: f1

Whereas it is ~23x faster and only takes 40 seconds~ (upon more careful measurement, it is 6x faster and takes 150 seconds, which is what one would expect) to ingest the same prompt in 70B and looks like this with no shuttling: f2

Is this expected? Probably getting into something other than quantization. If this is not expected, I can close this and open a separate issue.

Update: After re-measuring the time, it seems about what I'd expect for the size ratios, so maybe all totally expected.

edk208 commented 1 month ago

Tested the quants and they look good. I think quant issue is solved.

As for the context ingestion, I'm also seeing very slow speeds. I suppose it could just be linear to scale, but it feels off.

grimulkan commented 1 month ago

Okay lets see if turbo has any views on ingestion and GPU shuttling. Closing issue as quant is resolved (with fixes in thread above).

turboderp commented 1 month ago

Regarding the prompt ingestion, I'm thinking it comes down to a feature I added to deal with very large vocabularies in smaller models.

For quantized layers the matmul switches between either using a custom kernel or dequantizing the entire matrix to global memory and doing a regular FP16 @ FP16 matmul using cuBLAS. For a large enough number of input rows, i.e. when the operation is compute-bound enough, there is always a point at which it works out to be more efficient to only dequantize every weight once even if it then has to be temporarily written to GMEM as opposed to much faster SMEM or even registers.

This requires reserving a buffer, though, large enough for the FP16 version of the largest matrix in the model. For a lot of recent models with 128k+ vocabularies, this ends up being the output linear layer, which can require several gigabytes of VRAM. And this is reserved for, essentially, perplexity calculation, since regular prompt ingestion wouldn't need the output at all, and generation only computes logits for the output tokens, with a number of rows equal to the batch size (or some small multiple of it in speculative decoding).

To avoid this VRAM overhead in the typical case where you're just doing inference and don't need logits for input tokens, there's a max_dq_size threshold set in the model's config, which is largest number of weights that can be dequantized to GMEM at once. The idea is to split these very large matrices into slices along k to preserve VRAM while also getting most of the benefits of cuBLAS, hopefully, but that hasn't been fully implemented yet. In the meantime it just falls back on the quantized matmul, and I think likely that's why you're seeing the speeds you are, since the default is 512 * 1024*2, and all the MLP matrices in the model would be 832 1024**2.

You should be able to re-enable cuBLAS for all layers (including the output layer) by just setting config.max_dq_size >= hidden_size * vocab_size (e.g. 2 * 1024**3) and, unless there's some other complication I can't think of, both prompt ingestion and the perplexity test should run faster, but you will reserve around 4 GB of VRAM for dequantization.

A half measure would be to set config.max_dq_size >= hidden_size * intermediate_size (e.g. 1024**3) which should allow all layers except the output layer to use cuBLAS during prompt ingestion. You could then test the raw prompt ingestion speed with the -ps argument to the test script.

edk208 commented 1 month ago

amazing thank you. Some quick napkin math - 2048 token ingestion is 136 seconds with max_dq_size = 512 * 1024 * 2. With max_dq_size = 2 1024 ** 3, 2048 token ingestion is 18 seconds.

grimulkan commented 1 month ago

Hmm... I see no difference, and the GPU shuttling still happens. I tested by measuring the time before the output of the first token in the inference script (as opposed to pure perplexity measurement).

Am I setting it correctly? Modify config.json to include:

{
    "architectures": [
        "LlamaForCausalLM"
    ],
    ...
    "max_dq_size": 2147483648,
    "quantization_config": {
    ...
        }
    }
}

There was no max_dq_size entry and I created it. Or should I set it directly in the code?

turboderp commented 1 month ago

Sorry, I wasn't too clear. It's in the ExLlamaV2Config. You would set it right before loading the model. You could also change the default in exllamav2/config.py

grimulkan commented 1 month ago

Cool. With max_dq_size = 2 * 1024 ** 3, ingesting nearly the full 128K of context drops down from 30+ minutes to <5 minutes (only about 2x slower than the 70B). Generation speed is the same (about 3x slower than the 70B of the same bpw). Leaving the max_dq_size at the higher value doesn't seem to change any of my GPU breakpoints for smaller models, but maybe it does for some people.

@turboderp GPU utilization still shuttles rapidly between all GPUs during ingestion, at an even tighter timescale than before. Guess I don't really understand why this happens! Wouldn't ingestion just be one long forward pass, layer by layer?

f3

Also, are you taking advantage of any P2P if you need to pass info between the layers, or playing it safe? If there was just one transfer each for the residuals, it probably doesn't matter, but if it's shuttling back and forth 100s of times then maybe it does.

If you can think of any other tips to improve prompt ingestion here, do let us know! Even if you don't offer it as a user-friendly option, at least the community knows about it.

For anyone else reading this, summary of user fixes so far for 405B:

edk208 commented 1 month ago

@grimulkan are you using dynamic gen? I am seeing layer by layer ingestion; however, if you have small chunk sizes, e.g. max_chunk_size, it will do that chunk first, then go back and do the next chunk - maybe this is explaining the multi passes.

grimulkan commented 1 month ago

Not dynamic, but I think you're right in that it is chunking. Could be max_input_len which defaults to 2048 in ExllamaV2Config. Let me try increasing that.

EDIT: Hmm... increasing it to 16384 caused the flash attention call to crash:

  File "z:\code\text-generation-webui\repositories\exllamav2\exllamav2\model.py", line 955, in forward_chunk
    x = module.forward(x, cache = cache, attn_params = attn_params, past_len = past_len, loras = loras, **kwargs)
  File "z:\code\text-generation-webui\repositories\exllamav2\exllamav2\attn.py", line 984, in forward
    attn_output = attn_func(batch_size, q_len, q_states, k_states, v_states, attn_params, cfg)
  File "z:\code\text-generation-webui\repositories\exllamav2\exllamav2\attn.py", line 815, in _attn_flash
    attn_output = flash_attn_func(
  File "Z:\Code\text-generation-webui\venv\lib\site-packages\flash_attn\flash_attn_interface.py", line 881, in flash_attn_func
    return FlashAttnFunc.apply(
  File "Z:\Code\text-generation-webui\venv\lib\site-packages\torch\autograd\function.py", line 598, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "Z:\Code\text-generation-webui\venv\lib\site-packages\flash_attn\flash_attn_interface.py", line 547, in forward
    out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_forward(
  File "Z:\Code\text-generation-webui\venv\lib\site-packages\flash_attn\flash_attn_interface.py", line 51, in _flash_attn_forward
    out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.fwd(
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Maybe turbo set it to 2048 for a reason.

@edk208 When you saw layer by layer ingestion, was it over a smaller context length?

edk208 commented 1 month ago

@grimulkan yes I haven't tried anything over 12k context length. I am currently using a 11-gpu (3090s) setup for inference so I'm currently vRAM starved.

turboderp commented 1 month ago

Yes, you're definitely seeing the effects of chunking. You can increase the chunk size by increasing max_input_len, but it's also a VRAM tradeoff, and there are diminishing returns from increasing it past a certain point.