I am trying to quantize the llama-2-7b-chat-hf using the gpt fast using:-
python quantize.py --mode int4 --groupsize 32
on Kaggle using Kaggle T4*2 GPU.
I have installed pytorch nightly using:-
pip install torch==2.3.0.dev20240117+cu121 --index-url https://download.pytorch.org/whl/nightly/cu121
I had even tried changing dtype from torch.bfloat16 to torch.flat32. But got the same error again.
However, I get this error message:-
Loading model ...
Quantizing model weights for int4 weight-only affine per-channel groupwise quantization
linear: layers.0.attention.wqkv, in=4096, out=12288
linear: layers.0.attention.wo, in=4096, out=4096
Traceback (most recent call last):
File "/kaggle/working/quantize.py", line 605, in
quantize(args.checkpoint_path, args.mode, args.groupsize, args.calibration_tasks, args.calibration_limit, args.calibration_seq_length, args.pad_calibration_inputs, args.percdamp, args.blocksize, args.label)
File "/kaggle/working/quantize.py", line 552, in quantize
quantized_state_dict = quant_handler.create_quantized_state_dict()
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/kaggle/working/quantize.py", line 416, in create_quantized_state_dict
weight_int4pack, scales_and_zeros = prepare_int4_weight_and_scales_and_zeros(
File "/kaggle/working/quantize.py", line 348, in prepare_int4_weight_and_scales_and_zeros
weight_int32, scales_and_zeros = group_quantize_tensor(
File "/kaggle/working/quantize.py", line 131, in group_quantize_tensor
scales, zeros = get_group_qparams(w, n_bit, groupsize)
File "/kaggle/working/quantize.py", line 66, in get_group_qparams
assert torch.isnan(to_quant).sum() == 0
RuntimeError: CUDA error: named symbol not found
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
I am trying to quantize the llama-2-7b-chat-hf using the gpt fast using:- python quantize.py --mode int4 --groupsize 32 on Kaggle using Kaggle T4*2 GPU. I have installed pytorch nightly using:- pip install torch==2.3.0.dev20240117+cu121 --index-url https://download.pytorch.org/whl/nightly/cu121
I had even tried changing dtype from torch.bfloat16 to torch.flat32. But got the same error again.
However, I get this error message:-
Loading model ... Quantizing model weights for int4 weight-only affine per-channel groupwise quantization linear: layers.0.attention.wqkv, in=4096, out=12288 linear: layers.0.attention.wo, in=4096, out=4096 Traceback (most recent call last): File "/kaggle/working/quantize.py", line 605, in
quantize(args.checkpoint_path, args.mode, args.groupsize, args.calibration_tasks, args.calibration_limit, args.calibration_seq_length, args.pad_calibration_inputs, args.percdamp, args.blocksize, args.label)
File "/kaggle/working/quantize.py", line 552, in quantize
quantized_state_dict = quant_handler.create_quantized_state_dict()
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/kaggle/working/quantize.py", line 416, in create_quantized_state_dict
weight_int4pack, scales_and_zeros = prepare_int4_weight_and_scales_and_zeros(
File "/kaggle/working/quantize.py", line 348, in prepare_int4_weight_and_scales_and_zeros
weight_int32, scales_and_zeros = group_quantize_tensor(
File "/kaggle/working/quantize.py", line 131, in group_quantize_tensor
scales, zeros = get_group_qparams(w, n_bit, groupsize)
File "/kaggle/working/quantize.py", line 66, in get_group_qparams
assert torch.isnan(to_quant).sum() == 0
RuntimeError: CUDA error: named symbol not found
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.