mit-han-lab / llm-awq

[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
MIT License
2.38k stars 184 forks source link

Illegal memory access for LLama-3-70B #200

Open pprp opened 3 months ago

pprp commented 3 months ago

image

I find that tinychat can not support LLama-3-70B as shown in above figure.

USER: test                                                                                       
ASSISTANT: Traceback (most recent call last):                                                    
  File "/home/user/workspace/llm-awq/tinychat/demo.py", line 231, in <module>                 
    outputs = stream_output(output_stream)
  File "/home/user/workspace/llm-awq/tinychat/demo.py", line 53, in stream_output
    for outputs in output_stream:
  File "/home/user/miniconda3/envs/awq/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 56, in generator_context
    response = gen.send(request)
  File "/home/user/workspace/llm-awq/tinychat/stream_generators/stream_gen.py", line 91, in StreamGenerator
    out = model(inputs, start_pos=start_pos)
  File "/home/user/miniconda3/envs/awq/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda3/envs/awq/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/miniconda3/envs/awq/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/workspace/llm-awq/tinychat/models/llama.py", line 332, in forward
    h = self.model(tokens, start_pos, inputs_embeds)
  File "/home/user/miniconda3/envs/awq/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda3/envs/awq/lib/python3.10/site-packages/torch/nn/modules/module.py
", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/miniconda3/envs/awq/lib/python3.10/site-packages/torch/utils/_contextlib.py
", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/workspace/llm-awq/tinychat/models/llama.py", line 316, in forward
    h = layer(h, start_pos, freqs_cis, mask)
  File "/home/user/miniconda3/envs/awq/lib/python3.10/site-packages/torch/nn/modules/module.py
", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda3/envs/awq/lib/python3.10/site-packages/torch/nn/modules/module.py
", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/workspace/llm-awq/tinychat/models/llama.py", line 263, in forward
    h = x + self.self_attn.forward(
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

However, I can evaluate the performance using the following script:

MODEL=llama-3-70b-instruct
# load and evaluate the real quantized model (smaller gpu memory usage)
python -m awq.entry --model_path /aidata/hf_download/$MODEL \
    --tasks wikitext \
    --w_bit 4 --q_group_size 128 \
    --load_quant quant_cache/$MODEL-w4-g128-awq-v2.pt

and it works well.