[Question] What is the expected discrepancy between simulated and actually computed values?

First: thanks for this implementation, I'm using it to load 7B models in my 8 GiB GPU using Ooba Gooba (Which fails to report how much memory did it use, I had to patch the code, and also fails to mention that you need more than 1 GiB extra VRAM for the default 2k tokens context).

Now more context:

I mean about the difference between the values printed by test_kernel.py as Simu and the ones printed as Kern
I'm talking about the old-cuda branch.
I'm using a HIPified version found here

So I'm not using CUDA and from what I see test_kernel.py isn't really Verifiying kernel correctness, just checking it doesn't crash. Am I missing something?

My board is a Radeon RX5500XT and it isn't officially supported by ROCm. I know the FP16 implementation has some issues, so it wasn't a surprise that the faster versions wasn't performing really faster and that their error is much bigger.

But what I want to know is if the error of the regular versions is in the expected range. Note that I'm 100% new to PyTorch. I failed to force deterministic values, and I only computed absolute errors. I found a discrepancy that is usually under 1e-5 and sometimes a little bit over. The faster was much worst.

Is this normal? How much error should I expect? Can test_kernel.py really verify the results?

BTW: I uploaded pre-compiled wheels for Python 3.7, 3.8, 3.9 and 3.10 that are usable for PyTorch 1.13.1 (2.x isn't working) compiled with ROCm 5.2 (the official PyTorch release for ROCm). They can be found here

A note for the authors: consider enabling GitHub discussions, this post isn't a real issue that should be posted as a discussion.

So I'm not using CUDA and from what I see test_kernel.py isn't really Verifiying kernel correctness, just checking it doesn't crash. Am I missing something?

I think you're right. I suspect @qwopqwop200 focuses mostly on perplexity evaluation to confirm correctness. test_kernel.py is to make sure nothing crashes and to compare performance with or without cuda.

I failed to force deterministic values, and I only computed absolute errors.

test_kernel uses random noise as weights and input so the output shouldn't be deterministic, AFAIK.

Either way, I'm not an expert but I believe, generally, fully deterministic results (to the decimal) are impossible on the GPU because the order of operations affects the floating point error. The GPU launches hundreds or thousands of jobs (threads, warps etc) and chooses how to schedule them. Combine that with dynamic throttling and there's no way to tell what order they'll complete and accumulate to the output elements. (You could add code to synchronise it at a performance cost, which might be worthwhile if you're really in the weeds examining precision problems.)

I found a discrepancy that is usually under 1e-5 and sometimes a little bit over. The faster was much worst.

I don't know on the old-cuda branch because test_kernel doesn't seem to work there for me. I also don't have a Radeon card. But on my 4090 on the cuda branch, these are the errors I get for the 4bit kernel, comparing the output of layer vs qlayer.

# Run 1
Mean Absolute Error (MAE): 1.190901e-04
Root Mean Square Error (RMSE): 2.441406e-04
# Run 2
Mean Absolute Error (MAE): 1.186728e-04
Root Mean Square Error (RMSE): 2.441406e-04

So looks like if your error is 1e-5 you're actually doing quite well? Maybe I misunderstood what you're measuring.

I used this code:

import torch.nn.functional as F

with torch.no_grad():
    full_model_output = layer(vec)
    quant_model_output = qlayer(vec)

    print('Full model output:', full_model_output)
    print('Quantized model output:', quant_model_output)

    # Compute Mean Absolute Error (MAE)
    mae = F.l1_loss(quant_model_output, full_model_output)
    print('Mean Absolute Error (MAE):', mae)

    # Compute Root Mean Square Error (RMSE)
    mse = F.mse_loss(quant_model_output, full_model_output)
    rmse = torch.sqrt(mse)
    print('Root Mean Square Error (RMSE):', rmse)

Incidentally, I believe RMSE is a better way to characterise the error here. Since neural networks have activation functions and normalisation, small amounts of absolute error have little impact (in theory).

So I'm not using CUDA and from what I see test_kernel.py isn't really Verifiying kernel correctness, just checking it doesn't crash. Am I missing something?

I think you're right. I suspect @qwopqwop200 focuses mostly on perplexity evaluation to confirm correctness.

I understand using the perplexity check is the ultimate way to test it. The problem is that I can't even load the 7B models without quantization, they need 7*4 = 28 GB when using FP32. I just have 16 GB of main memory and 8 GB VRAM. And it could take days to finish in a modest GPU (I think my GPU is 1/40 the RTX4090 speed) . Is also a good idea to check the individual components. I found that some of the last commits in the old_cuda branch introduces a lot of error even when the kernels remains the same. So these inaccuracies are introduced in the Python code. Knowing the source of inaccuracies helps to optimize the code. My goal is to know if the ROCm port is working as expected. I let the rest to the maintainers ;-)

test_kernel.py is to make sure nothing crashes and to compare performance with or without cuda.

Thanks for confirming it.

I failed to force deterministic values, and I only computed absolute errors.

test_kernel uses random noise as weights and input so the output shouldn't be deterministic, AFAIK.

Isn't it generated using random? (a pseudo aleatory random generator)

Either way, I'm not an expert but I believe, generally, fully deterministic results (to the decimal) are impossible on the GPU because the order of operations affects the floating point error. The GPU launches hundreds or thousands of jobs (threads, warps etc) and chooses how to schedule them. Combine that with dynamic throttling and there's no way to tell what order they'll complete and accumulate to the output elements. (You could add code to synchronise it at a performance cost, which might be worthwhile if you're really in the weeds examining precision problems.)

Ugh! didn't know it. But giving the small size of the kernels I guess these errors should be much smaller than the ones introduced by the approximation.

I found a discrepancy that is usually under 1e-5 and sometimes a little bit over. The faster was much worst.

I don't know on the old-cuda branch because test_kernel doesn't seem to work there for me. I also don't have a Radeon card. But on my 4090 on the cuda branch, these are the errors I get for the 4bit kernel, comparing the output of layer vs qlayer.
# Run 1
Mean Absolute Error (MAE): 1.190901e-04
Root Mean Square Error (RMSE): 2.441406e-04
# Run 2
Mean Absolute Error (MAE): 1.186728e-04
Root Mean Square Error (RMSE): 2.441406e-04
So looks like if your error is 1e-5 you're actually doing quite well? Maybe I misunderstood what you're measuring.

I'll check using the same metrics you used. Thanks for the information! It gives me a known reference.

I used this code:

import torch.nn.functional as F

with torch.no_grad():
    full_model_output = layer(vec)
    quant_model_output = qlayer(vec)

    print('Full model output:', full_model_output)
    print('Quantized model output:', quant_model_output)

    # Compute Mean Absolute Error (MAE)
    mae = F.l1_loss(quant_model_output, full_model_output)
    print('Mean Absolute Error (MAE):', mae)

    # Compute Root Mean Square Error (RMSE)
    mse = F.mse_loss(quant_model_output, full_model_output)
    rmse = torch.sqrt(mse)
    print('Root Mean Square Error (RMSE):', rmse)

Most probably, I think I saw some comments about using RMSE.

I added MAE and RMSE metrics, here is what I get for the different cases:

$ python test_kernel.py 
Benchmarking LLaMa-7B FC2 matvec ...
FP32: 0.0013859221935272217
2bit FP32:  7.39% 0.001283557891845703
2bit FP16: 24.53% 0.0010459866523742675
3bit FP32: 32.25% 0.0009390249252319336
3bit FP16: 32.67% 0.0009330761432647705
4bit FP32: 34.36% 0.0009096791744232178
4bit FP16: 40.41% 0.0008259179592132568
8bit FP32: 48.60% 0.0007123064994812012
Verifiying kernel correctness ...
2 bits:
- Normal version: max abs 6.91e-06, Mean Absolute Error (MAE): 3.74e-07 Root Mean Square Error (RMSE): 5.27e-07
- Faster version: max abs 2.05e-03, Mean Absolute Error (MAE): 2.21e-04 Root Mean Square Error (RMSE): 2.81e-04
3 bits:
- Normal version: max abs 7.99e-06, Mean Absolute Error (MAE): 4.46e-07 Root Mean Square Error (RMSE): 6.29e-07
- Faster version: max abs 3.52e-03, Mean Absolute Error (MAE): 3.26e-04 Root Mean Square Error (RMSE): 4.17e-04
4 bits:
- Normal version: max abs 8.11e-06, Mean Absolute Error (MAE): 4.69e-07 Root Mean Square Error (RMSE): 6.61e-07
- Faster version: max abs 2.90e-03, Mean Absolute Error (MAE): 2.57e-04 Root Mean Square Error (RMSE): 3.43e-04
8 bits:
- Normal version: max abs 9.42e-06, Mean Absolute Error (MAE): 4.86e-07 Root Mean Square Error (RMSE): 6.84e-07

I'm using the old-cuda code, but not the last commit, I'm using this repo which is basically a fork of this repo. When I tried the last commits from old-cuda something didn't work in Ooba Gooba. This is why I kept the changes WapaMario used.

Now the values: The first part is the timing, I added the speed gain relative to FP32. I named N bits FP32 to the "normal" implementation and FP16 to the "faster" implementation. Then are the runs measuring errors. As you can see the "normal" kernels are much more accurate than the "faster" kernels. The RMSE for 2 bits is similar to what you get, 3 and 4 bits are a little bit worst.

I uploaded binary wheels here, they just contain the dynamic lib with the bindings.

I'll see if I can get a traceback of what is failing with the last commits from old-cuda.

P.S. PyTorch 1.13.1 is the last version I can get working on my Radeon board. All 2.x I tried generates memory faults. I think WapaMario also had this problem.

Right, to clarify I was testing in FP16. So since "faster" is your FP16 run, your MAE error is about 200% and the RMSE error about 150%.

I understand using the perplexity check is the ultimate way to test it. The problem is that I can't even load the 7B models without quantization, they need 7*4 = 28 GB when using FP32. I just have 16 GB of main memory and 8 GB VRAM. And it could take days to finish in a modest GPU (I think my GPU is 1/40 the RTX4090 speed) . Is also a good idea to check the individual components

I don't disagree but it sounds like you have successfully tested this component at least? Your error is high but at least in the right ballpark. Maybe it's the best you can get if FP16 support is poor on your HW.

You might be able to use mixed precision (like when doing the matmuls, upcast everything into FP32 pre-multiply, accumulate into FP32, and then only when saving the output downcast back to FP16). This should have no impact on VRAM usage since you're working in on chip working memory. Yet it should vastly improve precision. It may not impact performance because we tend to be memory bandwidth limited, reading weights from global memory to shared memory. Although your card may not be able to compute and transfer data at the same time in which case you'll see a performance hit.

Anyhow, check your PPL! You should be able to get the quantised model running on 8 GB of VRAM and you can do the reference run for your perplexity numbers on CPU (although you don't have enough main memory either so it's going to be slow as hell, but hey you only need to do it once ). Or just look up the PPL numbers from the README.

qwopqwop200 / GPTQ-for-LLaMa

[Question] What is the expected discrepancy between simulated and actually computed values? #261