Open JohnnyRacer opened 2 days ago
quant_model
is not defined, not sure what you mean there,
but this might be known issue: https://github.com/pytorch/ao/issues/1117, that we are fixing in https://github.com/pytorch/ao/pull/1278 which will be landed soon
@jerryzh168 Sorry I changed up a few things during test and left out the line where the model was quantized on the CPU, but basically the model's output when quantizing on the CPU and GPU is significantly different. I don't think its related to #1117 since the difference is the same when executing the CPU quantized model on the CPU itself, instead of quantizing on the CPU and executing the model on the GPU.
how do you get the cpu_quant_model? int4_weight_only
only works on CUDA IIRC.
quantize_(cpu_quant_model, int4_weight_only())
runs fine without any errors or warnings on the CPU. You can run the code snippet I provided above and it should show the difference. cpu_quant_model
was always on the CPU and was never moved to the GPU.
Quantization on GPU works as expected with very small errors, but on CPU there seems to be a problem with the quantized model's output. Here is the code to replicate the problem.