pytorch-labs / gpt-fast

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
BSD 3-Clause "New" or "Revised" License
5.34k stars 484 forks source link

Fixing quantize in int4 mode #159

Open Artyom17 opened 2 months ago

Artyom17 commented 2 months ago

Int4 quantization requires CUDA device, however, in current impl --device param was overridden with 'cpu' unconditionally.

Artyom17 commented 2 months ago

@HDCharles ?

Chillee commented 2 months ago

Actually, I think I was the one who added this haha. For things like int8 quantization, you often don't want to materialize your entire model onto GPU before doing the quantization.

Artyom17 commented 2 months ago

Actually, I think I was the one who added this haha. For things like int8 quantization, you often don't want to materialize your entire model onto GPU before doing the quantization.

The issue is that if I quantize CPU version - it doesn't really work on GPU later. Not sure why, but that's what I got on H100: only GPU quantized version works. Either way, it is a bug: if you want to quantize of CPU by default, I think it would be better to set the default setting of the --device parameter to CPU.

jerryzh168 commented 2 months ago

Actually, I think I was the one who added this haha. For things like int8 quantization, you often don't want to materialize your entire model onto GPU before doing the quantization.

The issue is that if I quantize CPU version - it doesn't really work on GPU later. Not sure why, but that's what I got on H100: only GPU quantized version works. Either way, it is a bug: if you want to quantize of CPU by default, I think it would be better to set the default setting of the --device parameter to CPU.

this is probably related to packing, there is a silent numerical error right now if we use the packed weight on cpu v.s. cuda:

(Pdb) linear_forward_int4(torch.eye(4096, 4096, dtype=torch.bfloat16, device="cuda"), weight_int4pack.to("cuda"), scales_and_zeros.to("cuda"), out_features, self.groupsize)[:3,:3] tensor([[-0.0048, -0.0957, -0.0757], [ 0.0243, -0.0211, -0.0081], [ 0.0194, -0.0398, -0.0081]], device='cuda:0', dtype=torch.bfloat16) (Pdb) linear_forward_int4(torch.eye(4096, 4096, dtype=torch.bfloat16, device="cpu"), weight_int4pack.to("cpu"), scales_and_zeros.to("cpu"), out_features, self.groupsize)[:3,:3] tensor([[-4.8218e-03, 1.6235e-02, 1.9043e-02], [-1.4526e-02, -2.1118e-02, -8.0566e-03], [ 3.0518e-05, -2.4414e-03, 5.4932e-03]], dtype=torch.bfloat16)

cc @HDCharles