Open dirkson opened 3 months ago
This check runs regardless on NVIDIA systems, and doesn't generally cause an issue. But it looks like you have an executable called rocm-smi
in your path that you don't have the permissions to run. Which I hadn't accounted for.
I've changed it so that it should catch any exceptions from that check and you shouldn't have an issue. It's in the dev branch but a new release with this and a bunch more fixes is coming real soon.
After thinking about your response and the original error message a little, I managed to discover that I actually managed to add a directory my user doesn't have access to into PATH - That's the cause of the permission denied. The most dev commit still errors identically when this is the case, but does resolve correctly when the user running the software actually has permissions to their own PATH. Thanks/Sorry!
However, the rocm stuff is the far smaller of the two issues I mentioned. Here's a log of attempting to run inference with tensor parallelism but without flash attention in the most recent dev:
$ python examples/chat.py -m /home/llama/mod/exl2/magnum-v2-123b-exl2/ -nfa -mode codellama -c8 -l 4096 -tp
-- Model: /home/llama/mod/exl2/magnum-v2-123b-exl2/
-- Options: ['tensor_parallel', 'length: 4096', 'no_flash_attn']
-- Loading tokenizer...
-- Loading model...
-- Prompt format: codellama
-- System prompt:
You are a helpful coding assistant. Always answer as helpfully as possible.
User: Hi there! Please don't crash.
Traceback (most recent call last):
File "/home/llama/opt/exllamav2/examples/chat.py", line 295, in <module>
generator.begin_stream_ex(active_context, settings)
File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/site-packages/exllamav2/generator/streaming.py", line 363, in begin_stream_ex
self._gen_begin_reuse(input_ids, gen_settings)
File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/site-packages/exllamav2/generator/streaming.py", line 731, in _gen_begin_reuse
self._gen_begin(in_tokens, gen_settings)
File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/site-packages/exllamav2/generator/streaming.py", line 692, in _gen_begin
self.model.forward(self.sequence_ids[:, :-1],
File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/site-packages/exllamav2/model.py", line 868, in forward
r = self.forward_chunk(
^^^^^^^^^^^^^^^^^^^
File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/site-packages/exllamav2/model.py", line 976, in forward_chunk
x = module.forward(x, cache = cache, attn_params = attn_params, past_len = past_len, loras = loras, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/site-packages/exllamav2/attn.py", line 972, in forward
return self.forward_tp(
^^^^^^^^^^^^^^^^
File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/site-packages/exllamav2/attn.py", line 1143, in forward_tp
ext_c.tp_attn_forward_(
ModuleNotFoundError: No module named 'flash_attn_2_cuda'
Oh, right, I guess this wasn't really communicated anywhere, I apologize for that. But the TP feature currently requires flash attn. :/
It's still a little experimental and unfinished, and I'll probably add SDPA support sometime soon, though it still won't work with the dynamic generator.
Right, I've pushed an update to the dev branch that should allow inference with TP mode even when flash-attn isn't available. It uses a slower code path for now so it's hard to say if you'll see any speedup. Torch SDPA is very limited (it's not just paged attn that's missing but also GQA support) but I'll see about further improving performance down the line.
No speedup, but it does function!
Running the following command on a 8x P100 machine: python test_inference.py -m /home/llama/mod/exl2/magnum-v2-123b-exl2/ -nfa -p "Once upon a time," -gs auto
With -tp: -- Response generated in 55.37 seconds, 128 tokens, 2.31 tokens/second (includes prompt eval.) Without -tp: -- Response generated in 23.33 seconds, 128 tokens, 5.49 tokens/second (includes prompt eval.) With -nxf: -- Response generated in 23.25 seconds, 128 tokens, 5.50 tokens/second (includes prompt eval.)
For comparison, on my hardware, I'm used to getting around 8-10 tokens/second with a similar model in GPTQ running on whatever aphrodite engine uses for tensor parallelism.
I thought specifying -nxf might prompt exllamav2 to use sdpa, and thus help prove whether it was the lack of GQA support that hurt performance so much, but it didn't actually seem to affect performance at all.
A couple issues with the new tensor parallelism implementation!
1) Tensor Parallelism doesn't appear to respect a lack of flash attention, even via the -nfa flag. It also doesn't document flash attention as a requirement, instead crashing on the first attempted inference run when flash attention isn't available. My hardware doesn't have support for flash attention, so it would be super cool if the tensor parallelism implementation could fall back to xformers or similar.
2) Attempting to run tensor parallelism without also supplying gpu-split appears to result in the code looking for amd memory on nvidia computers. Adding in -gs appears to fix this, but it didn't seem like intended behavior?