turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.19k stars 234 forks source link

Qwen-110B quantize failed, RuntimeError: CUDA error: an illegal memory access was encountered #433

Closed buliaoyin closed 2 months ago

buliaoyin commented 2 months ago

I tried to quantize Qwen-110B-Chat in a server with 100GB RAM, and one L20(48G)GPU. After five hours, I got error as below:

-- model.layers.79.self_attn 5.1982 bpw - exp. error: 0.00168373 -- model.layers.79.mlp 2.9039 bpw - exp. error: 0.99999900 -- sum(log(err)): -537.800005 -- max(err): 0.999999 -- Tokenizing samples... -- Token embeddings again... -- Quantizing... -- Layer: model.layers.0 (Attention) -- Linear: model.layers.0.self_attn.q_proj -> 0.1:6b_32g/0.9:5b_32g s4, 5.23 bpw -- Linear: model.layers.0.self_attn.k_proj -> 0.1:6b_32g/0.9:5b_32g s4, 5.26 bpw -- Linear: model.layers.0.self_attn.v_proj -> 1:6b_32g s4, 6.16 bpw -- Linear: model.layers.0.self_attn.o_proj -> 0.1:6b_32g/0.9:5b_32g s4, 5.23 bpw -- Module quantized, rfn_error: 0.001032 -- Layer: model.layers.0 (MLP) -- Linear: model.layers.0.mlp.gate_proj -> 1:4b_128g s4, 4.03 bpw -- Linear: model.layers.0.mlp.up_proj -> 1:4b_32g s4, 4.13 bpw -- Linear: model.layers.0.mlp.down_proj -> 0.05:8b_32g/0.95:4b_128g s4, 4.24 bpw Traceback (most recent call last): File "/root/autodl-tmp/exllamav2-0.0.19/convert.py", line 268, in quant(job, save_job, model) File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/root/autodl-tmp/exllamav2-0.0.19/conversion/quantize.py", line 406, in quant quant_mlp(job, module, hidden_states, target_states, quantizers, attn_params, strat) File "/root/autodl-tmp/exllamav2-0.0.19/conversion/quantize.py", line 177, in quant_mlp quant_linear(job, module.down_proj, quantizers["down_proj"], strat["down_proj"]) File "/root/autodl-tmp/exllamav2-0.0.19/conversion/quantize.py", line 63, in quant_linear lq.quantize(keep_qweight = True, apply = True) File "/root/autodl-tmp/exllamav2-0.0.19/conversion/adaptivegptq.py", line 493, in quantize quantizer.find_params(weights[a : b, :]) File "/root/autodl-tmp/exllamav2-0.0.19/conversion/adaptivegptq.py", line 73, in find_params prescale = torch.tensor([1 / 256], dtype = torch.half, device = self.scale.device) RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

I tried 0.0.20 and 0.0.19. Command line as below: python convert.py -i /root/autodl-tmp/models/Qwen1.5-110B-Chat -o /root/autodl-tmp/models/tmp/ -cf /root/autodl-tmp/models/Qwen1.5-110B-Chat-3.3bpw-exl2/ -b 3.3

was there something I missed? I saw you had uploaded some quatized qwen-110b to HF.

buliaoyin commented 2 months ago

I got same error when try to use measurement.json as input, command an output as below:

python convert.py -i /root/autodl-tmp/models/Qwen1.5-110B-Chat -o /root/autodl-tmp/models/tmp/ -nr -m /root/autodl-tmp/models/measurement.json -cf /root/autodl-tmp/models/Qwen1.5-110B-Chat-3.3bpw-exl2/ -b 3.3 /root/miniconda3/lib/python3.10/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST']. warnings.warn( -- Beginning new job -- Input: /root/autodl-tmp/models/Qwen1.5-110B-Chat -- Output: /root/autodl-tmp/models/tmp/ -- Using default calibration dataset -- Target bits per weight: 3.3 (decoder), 6 (head) -- Max shard size: 8192 MB -- Full model will be compiled to: /root/autodl-tmp/models/Qwen1.5-110B-Chat-3.3bpw-exl2/ -- Reusing measurement: /root/autodl-tmp/models/measurement.json -- Optimizing... -- Optimizing: 1/ 240 -- Optimizing: 10/ 240 -- Optimizing: 19/ 240 -- Optimizing: 28/ 240 -- Optimizing: 37/ 240 -- Optimizing: 46/ 240 -- Optimizing: 56/ 240 -- Optimizing: 65/ 240 -- Optimizing: 74/ 240 -- Optimizing: 80/ 240 -- Optimizing: 90/ 240 -- Optimizing: 99/ 240 -- Optimizing: 108/ 240 -- Optimizing: 117/ 240 -- Optimizing: 126/ 240 -- Optimizing: 136/ 240 -- Optimizing: 145/ 240 -- Optimizing: 154/ 240 -- Optimizing: 163/ 240 -- Optimizing: 173/ 240 -- Optimizing: 183/ 240 -- Optimizing: 192/ 240 -- Optimizing: 201/ 240 -- Optimizing: 210/ 240 -- Optimizing: 219/ 240 -- Optimizing: 229/ 240 -- Optimizing: 238/ 240 -- max(err): 0.021599 -- error_norm: 1.584494 -- Quantization strategy: -- model.layers.0.self_attn 4.1334 bpw - exp. error: 0.00337239 -- model.layers.0.mlp 3.2718 bpw - exp. error: 0.01110011 -- model.layers.1.self_attn 3.1501 bpw - exp. error: 0.00209716 -- model.layers.1.mlp 3.3610 bpw - exp. error: 0.00472591 -- model.layers.2.self_attn 4.2848 bpw - exp. error: 0.00239126 -- model.layers.2.mlp 2.2355 bpw - exp. error: 0.00700947 -- model.layers.3.self_attn 4.2848 bpw - exp. error: 0.00209395 -- model.layers.3.mlp 2.2355 bpw - exp. error: 0.00872202 -- model.layers.4.self_attn 4.1501 bpw - exp. error: 0.00244899 -- model.layers.4.mlp 2.2355 bpw - exp. error: 0.01138319 -- model.layers.5.self_attn 4.0411 bpw - exp. error: 0.00317790 -- model.layers.5.mlp 2.3162 bpw - exp. error: 0.01387867 -- model.layers.6.self_attn 2.6605 bpw - exp. error: 0.00464182 -- model.layers.6.mlp 2.2355 bpw - exp. error: 0.01649428 -- model.layers.7.self_attn 4.1334 bpw - exp. error: 0.00293124 -- model.layers.7.mlp 2.5873 bpw - exp. error: 0.01421586 -- model.layers.8.self_attn 4.1501 bpw - exp. error: 0.00381284 -- model.layers.8.mlp 2.2355 bpw - exp. error: 0.01590392 -- model.layers.9.self_attn 4.0742 bpw - exp. error: 0.00415307 -- model.layers.9.mlp 2.5873 bpw - exp. error: 0.01194641 -- model.layers.10.self_attn 4.1334 bpw - exp. error: 0.00229423 -- model.layers.10.mlp 2.2355 bpw - exp. error: 0.00950497 -- model.layers.11.self_attn 2.6605 bpw - exp. error: 0.00537428 -- model.layers.11.mlp 2.3162 bpw - exp. error: 0.00957640 -- model.layers.12.self_attn 3.1501 bpw - exp. error: 0.00478647 -- model.layers.12.mlp 2.2355 bpw - exp. error: 0.01094840 -- model.layers.13.self_attn 4.1758 bpw - exp. error: 0.00235124 -- model.layers.13.mlp 2.2355 bpw - exp. error: 0.01219057 -- model.layers.14.self_attn 4.0742 bpw - exp. error: 0.00294156 -- model.layers.14.mlp 2.3162 bpw - exp. error: 0.01275339 -- model.layers.15.self_attn 4.1758 bpw - exp. error: 0.00252280 -- model.layers.15.mlp 2.2355 bpw - exp. error: 0.01434760 -- model.layers.16.self_attn 4.1758 bpw - exp. error: 0.00256146 -- model.layers.16.mlp 2.2355 bpw - exp. error: 0.01540449 -- model.layers.17.self_attn 4.1501 bpw - exp. error: 0.00311614 -- model.layers.17.mlp 2.3162 bpw - exp. error: 0.01692068 -- model.layers.18.self_attn 4.1758 bpw - exp. error: 0.00276434 -- model.layers.18.mlp 3.2718 bpw - exp. error: 0.00863046 -- model.layers.19.self_attn 4.2222 bpw - exp. error: 0.00251042 -- model.layers.19.mlp 2.2355 bpw - exp. error: 0.01903601 -- model.layers.20.self_attn 4.0394 bpw - exp. error: 0.00333480 -- model.layers.20.mlp 2.9039 bpw - exp. error: 0.01608464 -- model.layers.21.self_attn 4.2222 bpw - exp. error: 0.00251680 -- model.layers.21.mlp 3.2718 bpw - exp. error: 0.00932526 -- model.layers.22.self_attn 3.1501 bpw - exp. error: 0.00468102 -- model.layers.22.mlp 2.5873 bpw - exp. error: 0.01684832 -- model.layers.23.self_attn 4.1758 bpw - exp. error: 0.00201255 -- model.layers.23.mlp 2.2355 bpw - exp. error: 0.01969018 -- model.layers.24.self_attn 3.1487 bpw - exp. error: 0.00377518 -- model.layers.24.mlp 3.2718 bpw - exp. error: 0.00932186 -- model.layers.25.self_attn 3.1487 bpw - exp. error: 0.00376698 -- model.layers.25.mlp 2.5873 bpw - exp. error: 0.01633376 -- model.layers.26.self_attn 3.1487 bpw - exp. error: 0.00367504 -- model.layers.26.mlp 2.2355 bpw - exp. error: 0.01884940 -- model.layers.27.self_attn 4.2848 bpw - exp. error: 0.00116267 -- model.layers.27.mlp 2.2355 bpw - exp. error: 0.01877331 -- model.layers.28.self_attn 2.2265 bpw - exp. error: 0.00484873 -- model.layers.28.mlp 2.3162 bpw - exp. error: 0.01839219 -- model.layers.29.self_attn 4.1501 bpw - exp. error: 0.00157201 -- model.layers.29.mlp 2.3162 bpw - exp. error: 0.01838333 -- model.layers.30.self_attn 2.6605 bpw - exp. error: 0.00338455 -- model.layers.30.mlp 2.2355 bpw - exp. error: 0.01905772 -- model.layers.31.self_attn 2.6605 bpw - exp. error: 0.00377935 -- model.layers.31.mlp 2.5873 bpw - exp. error: 0.01689662 -- model.layers.32.self_attn 2.2265 bpw - exp. error: 0.00458009 -- model.layers.32.mlp 2.3162 bpw - exp. error: 0.01952240 -- model.layers.33.self_attn 4.2848 bpw - exp. error: 0.00116447 -- model.layers.33.mlp 2.3162 bpw - exp. error: 0.01990825 -- model.layers.34.self_attn 5.1982 bpw - exp. error: 0.00068019 -- model.layers.34.mlp 2.5873 bpw - exp. error: 0.01827445 -- model.layers.35.self_attn 2.1805 bpw - exp. error: 0.00599727 -- model.layers.35.mlp 2.2355 bpw - exp. error: 0.02159854 -- model.layers.36.self_attn 2.1254 bpw - exp. error: 0.00602214 -- model.layers.36.mlp 3.2718 bpw - exp. error: 0.01095048 -- model.layers.37.self_attn 2.6605 bpw - exp. error: 0.00404275 -- model.layers.37.mlp 3.3610 bpw - exp. error: 0.01040183 -- model.layers.38.self_attn 2.2265 bpw - exp. error: 0.00520107 -- model.layers.38.mlp 3.3610 bpw - exp. error: 0.01074564 -- model.layers.39.self_attn 3.1501 bpw - exp. error: 0.00361500 -- model.layers.39.mlp 3.2718 bpw - exp. error: 0.01205296 -- model.layers.40.self_attn 2.2265 bpw - exp. error: 0.00589020 -- model.layers.40.mlp 3.3610 bpw - exp. error: 0.01153003 -- model.layers.41.self_attn 3.1501 bpw - exp. error: 0.00387500 -- model.layers.41.mlp 3.2718 bpw - exp. error: 0.01298099 -- model.layers.42.self_attn 4.0394 bpw - exp. error: 0.00288964 -- model.layers.42.mlp 3.3610 bpw - exp. error: 0.01244317 -- model.layers.43.self_attn 2.1805 bpw - exp. error: 0.00854515 -- model.layers.43.mlp 3.3610 bpw - exp. error: 0.01284893 -- model.layers.44.self_attn 4.1758 bpw - exp. error: 0.00233257 -- model.layers.44.mlp 3.2718 bpw - exp. error: 0.01447445 -- model.layers.45.self_attn 4.1758 bpw - exp. error: 0.00270180 -- model.layers.45.mlp 3.3610 bpw - exp. error: 0.01381875 -- model.layers.46.self_attn 4.1501 bpw - exp. error: 0.00334884 -- model.layers.46.mlp 3.3610 bpw - exp. error: 0.01435570 -- model.layers.47.self_attn 4.1334 bpw - exp. error: 0.00374171 -- model.layers.47.mlp 3.2718 bpw - exp. error: 0.01620004 -- model.layers.48.self_attn 4.0742 bpw - exp. error: 0.00386939 -- model.layers.48.mlp 3.3610 bpw - exp. error: 0.01530019 -- model.layers.49.self_attn 4.0394 bpw - exp. error: 0.00458960 -- model.layers.49.mlp 3.3610 bpw - exp. error: 0.01567900 -- model.layers.50.self_attn 4.1758 bpw - exp. error: 0.00415799 -- model.layers.50.mlp 3.3610 bpw - exp. error: 0.01613761 -- model.layers.51.self_attn 4.1334 bpw - exp. error: 0.00496369 -- model.layers.51.mlp 3.2718 bpw - exp. error: 0.01785790 -- model.layers.52.self_attn 4.2848 bpw - exp. error: 0.00443452 -- model.layers.52.mlp 3.2718 bpw - exp. error: 0.01852214 -- model.layers.53.self_attn 4.0394 bpw - exp. error: 0.00579150 -- model.layers.53.mlp 3.3610 bpw - exp. error: 0.01730197 -- model.layers.54.self_attn 4.1334 bpw - exp. error: 0.00537363 -- model.layers.54.mlp 4.1327 bpw - exp. error: 0.00977645 -- model.layers.55.self_attn 4.1334 bpw - exp. error: 0.00569934 -- model.layers.55.mlp 3.3610 bpw - exp. error: 0.01808253 -- model.layers.56.self_attn 5.2848 bpw - exp. error: 0.00241577 -- model.layers.56.mlp 4.1937 bpw - exp. error: 0.00922209 -- model.layers.57.self_attn 5.2848 bpw - exp. error: 0.00241619 -- model.layers.57.mlp 3.3610 bpw - exp. error: 0.01864174 -- model.layers.58.self_attn 5.1982 bpw - exp. error: 0.00311745 -- model.layers.58.mlp 3.6145 bpw - exp. error: 0.01699709 -- model.layers.59.self_attn 5.2848 bpw - exp. error: 0.00255405 -- model.layers.59.mlp 4.1327 bpw - exp. error: 0.01043201 -- model.layers.60.self_attn 4.1501 bpw - exp. error: 0.00664028 -- model.layers.60.mlp 3.6145 bpw - exp. error: 0.01773141 -- model.layers.61.self_attn 5.1982 bpw - exp. error: 0.00336748 -- model.layers.61.mlp 4.1937 bpw - exp. error: 0.01028728 -- model.layers.62.self_attn 5.1982 bpw - exp. error: 0.00333582 -- model.layers.62.mlp 4.1327 bpw - exp. error: 0.01164173 -- model.layers.63.self_attn 6.0394 bpw - exp. error: 0.00222327 -- model.layers.63.mlp 4.1327 bpw - exp. error: 0.01228494 -- model.layers.64.self_attn 5.2848 bpw - exp. error: 0.00291456 -- model.layers.64.mlp 4.1327 bpw - exp. error: 0.01274496 -- model.layers.65.self_attn 6.2445 bpw - exp. error: 0.00153585 -- model.layers.65.mlp 4.1327 bpw - exp. error: 0.01361438 -- model.layers.66.self_attn 4.2848 bpw - exp. error: 0.00540327 -- model.layers.66.mlp 4.1937 bpw - exp. error: 0.01316815 -- model.layers.67.self_attn 4.2222 bpw - exp. error: 0.00641106 -- model.layers.67.mlp 4.1327 bpw - exp. error: 0.01542844 -- model.layers.68.self_attn 6.0394 bpw - exp. error: 0.00223345 -- model.layers.68.mlp 4.1937 bpw - exp. error: 0.01485634 -- model.layers.69.self_attn 5.2848 bpw - exp. error: 0.00302963 -- model.layers.69.mlp 4.1937 bpw - exp. error: 0.01543275 -- model.layers.70.self_attn 4.2848 bpw - exp. error: 0.00548195 -- model.layers.70.mlp 4.1937 bpw - exp. error: 0.01594568 -- model.layers.71.self_attn 4.2848 bpw - exp. error: 0.00561911 -- model.layers.71.mlp 4.1937 bpw - exp. error: 0.01631338 -- model.layers.72.self_attn 5.1982 bpw - exp. error: 0.00333445 -- model.layers.72.mlp 4.1937 bpw - exp. error: 0.01655168 -- model.layers.73.self_attn 4.2222 bpw - exp. error: 0.00560899 -- model.layers.73.mlp 4.3443 bpw - exp. error: 0.01630432 -- model.layers.74.self_attn 6.0394 bpw - exp. error: 0.00217065 -- model.layers.74.mlp 4.1937 bpw - exp. error: 0.01746052 -- model.layers.75.self_attn 4.2222 bpw - exp. error: 0.00629104 -- model.layers.75.mlp 4.3443 bpw - exp. error: 0.01736878 -- model.layers.76.self_attn 5.1982 bpw - exp. error: 0.00335729 -- model.layers.76.mlp 4.1937 bpw - exp. error: 0.01890172 -- model.layers.77.self_attn 4.1501 bpw - exp. error: 0.00705161 -- model.layers.77.mlp 5.2384 bpw - exp. error: 0.01174899 -- model.layers.78.self_attn 4.2222 bpw - exp. error: 0.00561177 -- model.layers.78.mlp 4.3443 bpw - exp. error: 0.02060733 -- model.layers.79.self_attn 4.0411 bpw - exp. error: 0.00386650 -- model.layers.79.mlp 4.3443 bpw - exp. error: 0.01537269 -- sum(log(err)): -795.321439 -- max(err): 0.021599 -- Tokenizing samples... -- Token embeddings again... -- Quantizing... -- Layer: model.layers.0 (Attention) -- Linear: model.layers.0.self_attn.q_proj -> 1:4b_32g s4, 4.13 bpw -- Linear: model.layers.0.self_attn.k_proj -> 1:4b_32g s4, 4.16 bpw -- Linear: model.layers.0.self_attn.v_proj -> 1:4b_32g s4, 4.16 bpw -- Linear: model.layers.0.self_attn.o_proj -> 1:4b_32g s4, 4.13 bpw -- Module quantized, rfn_error: 0.003063 -- Layer: model.layers.0 (MLP) -- Linear: model.layers.0.mlp.gate_proj -> 0.1:4b_128g/0.9:3b_128g s4, 3.14 bpw -- Linear: model.layers.0.mlp.up_proj -> 0.25:4b_128g/0.75:3b_128g s4, 3.28 bpw -- Linear: model.layers.0.mlp.down_proj -> 0.05:8b_32g/0.1:4b_128g/0.85:3b_128g s4, 3.39 bpw Traceback (most recent call last): File "/root/autodl-tmp/exllamav2-0.0.20/convert.py", line 268, in quant(job, save_job, model) File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/root/autodl-tmp/exllamav2-0.0.20/conversion/quantize.py", line 406, in quant quant_mlp(job, module, hidden_states, target_states, quantizers, attn_params, strat) File "/root/autodl-tmp/exllamav2-0.0.20/conversion/quantize.py", line 177, in quant_mlp quant_linear(job, module.down_proj, quantizers["down_proj"], strat["down_proj"]) File "/root/autodl-tmp/exllamav2-0.0.20/conversion/quantize.py", line 63, in quant_linear lq.quantize(keep_qweight = True, apply = True) File "/root/autodl-tmp/exllamav2-0.0.20/conversion/adaptivegptq.py", line 509, in quantize quantizer.find_params(weights[a : b, :]) File "/root/autodl-tmp/exllamav2-0.0.20/conversion/adaptivegptq.py", line 73, in find_params prescale = torch.tensor([1 / 256], dtype = torch.half, device = self.scale.device) RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Yiximail commented 2 months ago

Same situation as mine yesterday, and turboderp gave me a solution. https://github.com/turboderp/exllamav2/discussions/430#discussioncomment-9246834

buliaoyin commented 2 months ago

Same situation as mine yesterday, and turboderp gave me a solution. #430 (comment)

Thanks for your comment, I'll try it. Unnoticed there was also a discussion area.