I find that tinychat can not support LLama-3-70B as shown in above figure.
USER: test
ASSISTANT: Traceback (most recent call last):
File "/home/user/workspace/llm-awq/tinychat/demo.py", line 231, in <module>
outputs = stream_output(output_stream)
File "/home/user/workspace/llm-awq/tinychat/demo.py", line 53, in stream_output
for outputs in output_stream:
File "/home/user/miniconda3/envs/awq/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 56, in generator_context
response = gen.send(request)
File "/home/user/workspace/llm-awq/tinychat/stream_generators/stream_gen.py", line 91, in StreamGenerator
out = model(inputs, start_pos=start_pos)
File "/home/user/miniconda3/envs/awq/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/miniconda3/envs/awq/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/miniconda3/envs/awq/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/user/workspace/llm-awq/tinychat/models/llama.py", line 332, in forward
h = self.model(tokens, start_pos, inputs_embeds)
File "/home/user/miniconda3/envs/awq/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/miniconda3/envs/awq/lib/python3.10/site-packages/torch/nn/modules/module.py
", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/miniconda3/envs/awq/lib/python3.10/site-packages/torch/utils/_contextlib.py
", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/user/workspace/llm-awq/tinychat/models/llama.py", line 316, in forward
h = layer(h, start_pos, freqs_cis, mask)
File "/home/user/miniconda3/envs/awq/lib/python3.10/site-packages/torch/nn/modules/module.py
", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/user/miniconda3/envs/awq/lib/python3.10/site-packages/torch/nn/modules/module.py
", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/workspace/llm-awq/tinychat/models/llama.py", line 263, in forward
h = x + self.self_attn.forward(
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
However, I can evaluate the performance using the following script:
MODEL=llama-3-70b-instruct
# load and evaluate the real quantized model (smaller gpu memory usage)
python -m awq.entry --model_path /aidata/hf_download/$MODEL \
--tasks wikitext \
--w_bit 4 --q_group_size 128 \
--load_quant quant_cache/$MODEL-w4-g128-awq-v2.pt
I find that tinychat can not support LLama-3-70B as shown in above figure.
However, I can evaluate the performance using the following script:
and it works well.