johnsmith0031 / alpaca_lora_4bit

MIT License
533 stars 84 forks source link

Scrambled output on latest kernel/commits. #94

Open Ph0rk0z opened 1 year ago

Ph0rk0z commented 1 year ago

I had to change the atomic add for it to compile.

__device__ __forceinline__ void atomicAdd(__half* address, c10::Half val)

Updated quant_cuda_kernel.cu and matmul_utils_4bit.py.. now I get scrambled output. I think the problem is there.. updating everything has the same result.

scramble

Ph0rk0z commented 1 year ago

Also previous kernel fix breaks compatibility with cuda GPTQ

 File "/home/mint/gptq2/text-generation-webui-testing/repositories/GPTQ-Merged/src/gptq_llama/quant.py", line 428, in forward
    quant_cuda.vecquant4matmul(x, self.qweight, y, self.scales, self.qzeros, self.groupsize)
TypeError: vecquant4matmul(): incompatible function arguments. The following argument types are supported:
    1. (arg0: torch.Tensor, arg1: torch.Tensor, arg2: torch.Tensor, arg3: torch.Tensor, arg4: torch.Tensor, arg5: torch.Tensor) -> None

On a positive note, that kernel fix gave some slight speed bump to mlp attention.

johnsmith0031 commented 1 year ago

I think it still does not support cuda < 700, which does not have native support for half atomicAdd, even it can be compiled, still does not work. Also, removed groupsize support for v2 model in quant_cuda.vecquant4matmul which can be completely replaced by g_idx. I think you'd better use the older version of the kernel...

Ph0rk0z commented 1 year ago

I am, but it used to work this way for the V1. Next week I get this + 3090 + P40. I can try to train something (finally use the repo for it's actual purpose) and run 65b between the two cards.

The compatibility break might cause issue with textgen main when they replace their GPTQ and use the monkeypatch.

Ph0rk0z commented 1 year ago

Hey.. you didn't update the GPTQ format to the latest along with this, did you? Because loading incompatible models also has this effect. I would have to test a v1 model to be sure.

johnsmith0031 commented 1 year ago

Yes, for using act-order you should make the setting as follow:

import matmul_utils_4bit
matmul_utils_4bit.faster = False
Ph0rk0z commented 1 year ago

Guess I should give this a try again.. maybe pascal is compatible and it was just my model. Its getting pretty confusing. V1, V2, V3, etc.

Ph0rk0z commented 1 year ago

I tried the kernel again without the changes to this repo.

Now I get decent bump in generation with faster disabled. 13b Output generated in 17.27 seconds (11.52 tokens/s, 199 tokens, context 116, seed 1404086455) So the changes to matmul or autograd py in this repo are what caused scrambled output for me. The kernel is fine except that it breaks compatibility with GPTQ.

TypeError: vecquant4matmul(): incompatible function arguments. The following argument types are supported:

I guess for me I will have to add a vecquant4matmul_slower or something.