Open Artyom17 opened 2 months ago
@HDCharles ?
Actually, I think I was the one who added this haha. For things like int8 quantization, you often don't want to materialize your entire model onto GPU before doing the quantization.
Actually, I think I was the one who added this haha. For things like int8 quantization, you often don't want to materialize your entire model onto GPU before doing the quantization.
The issue is that if I quantize CPU version - it doesn't really work on GPU later. Not sure why, but that's what I got on H100: only GPU quantized version works. Either way, it is a bug: if you want to quantize of CPU by default, I think it would be better to set the default setting of the --device parameter to CPU.
Actually, I think I was the one who added this haha. For things like int8 quantization, you often don't want to materialize your entire model onto GPU before doing the quantization.
The issue is that if I quantize CPU version - it doesn't really work on GPU later. Not sure why, but that's what I got on H100: only GPU quantized version works. Either way, it is a bug: if you want to quantize of CPU by default, I think it would be better to set the default setting of the --device parameter to CPU.
this is probably related to packing, there is a silent numerical error right now if we use the packed weight on cpu v.s. cuda:
(Pdb) linear_forward_int4(torch.eye(4096, 4096, dtype=torch.bfloat16, device="cuda"), weight_int4pack.to("cuda"), scales_and_zeros.to("cuda"), out_features, self.groupsize)[:3,:3] tensor([[-0.0048, -0.0957, -0.0757], [ 0.0243, -0.0211, -0.0081], [ 0.0194, -0.0398, -0.0081]], device='cuda:0', dtype=torch.bfloat16) (Pdb) linear_forward_int4(torch.eye(4096, 4096, dtype=torch.bfloat16, device="cpu"), weight_int4pack.to("cpu"), scales_and_zeros.to("cpu"), out_features, self.groupsize)[:3,:3] tensor([[-4.8218e-03, 1.6235e-02, 1.9043e-02], [-1.4526e-02, -2.1118e-02, -8.0566e-03], [ 3.0518e-05, -2.4414e-03, 5.4932e-03]], dtype=torch.bfloat16)
cc @HDCharles
Int4 quantization requires CUDA device, however, in current impl --device param was overridden with 'cpu' unconditionally.