turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.49k stars 263 forks source link

Tensor parallelism issues #598

Open dirkson opened 3 weeks ago

dirkson commented 3 weeks ago

A couple issues with the new tensor parallelism implementation!

1) Tensor Parallelism doesn't appear to respect a lack of flash attention, even via the -nfa flag. It also doesn't document flash attention as a requirement, instead crashing on the first attempted inference run when flash attention isn't available. My hardware doesn't have support for flash attention, so it would be super cool if the tensor parallelism implementation could fall back to xformers or similar.

2) Attempting to run tensor parallelism without also supplying gpu-split appears to result in the code looking for amd memory on nvidia computers. Adding in -gs appears to fix this, but it didn't seem like intended behavior?

-- Options: ['tensor_parallel']
Traceback (most recent call last):
  File "/home/llama/opt/exllamav2/test_inference.py", line 100, in <module>
    model, tokenizer = model_init.init(
                       ^^^^^^^^^^^^^^^^
  File "/home/llama/opt/exllamav2/exllamav2/model_init.py", line 135, in init
    post_init_load(
  File "/home/llama/opt/exllamav2/exllamav2/model_init.py", line 168, in post_init_load
    model.load_tp(split, progress = progress)
  File "/home/llama/opt/exllamav2/exllamav2/model.py", line 373, in load_tp
    for item in f:
  File "/home/llama/opt/exllamav2/exllamav2/model.py", line 388, in load_tp_gen
    self.tp_context = TPContext(self, gpu_split, expect_cache_tokens, expect_cache_base)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/llama/opt/exllamav2/exllamav2/tensor_p.py", line 84, in __init__
    self.define_split(gpu_split, expect_cache_tokens, expect_cache_base)
  File "/home/llama/opt/exllamav2/exllamav2/tensor_p.py", line 111, in define_split
    gpu_memory = get_all_gpu_memory()
                 ^^^^^^^^^^^^^^^^^^^^
  File "/home/llama/opt/exllamav2/exllamav2/util.py", line 299, in get_all_gpu_memory
    amd_memory = get_amd_gpu_memory()
                 ^^^^^^^^^^^^^^^^^^^^
  File "/home/llama/opt/exllamav2/exllamav2/util.py", line 259, in get_amd_gpu_memory
    result = subprocess.run(
             ^^^^^^^^^^^^^^^
  File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/subprocess.py", line 548, in run
    with Popen(*popenargs, **kwargs) as process:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/subprocess.py", line 1026, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/subprocess.py", line 1955, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
PermissionError: [Errno 13] Permission denied: 'rocm-smi'
turboderp commented 2 weeks ago

This check runs regardless on NVIDIA systems, and doesn't generally cause an issue. But it looks like you have an executable called rocm-smi in your path that you don't have the permissions to run. Which I hadn't accounted for.

I've changed it so that it should catch any exceptions from that check and you shouldn't have an issue. It's in the dev branch but a new release with this and a bunch more fixes is coming real soon.

dirkson commented 2 weeks ago

After thinking about your response and the original error message a little, I managed to discover that I actually managed to add a directory my user doesn't have access to into PATH - That's the cause of the permission denied. The most dev commit still errors identically when this is the case, but does resolve correctly when the user running the software actually has permissions to their own PATH. Thanks/Sorry!

However, the rocm stuff is the far smaller of the two issues I mentioned. Here's a log of attempting to run inference with tensor parallelism but without flash attention in the most recent dev:

$ python examples/chat.py -m /home/llama/mod/exl2/magnum-v2-123b-exl2/ -nfa -mode codellama -c8 -l 4096 -tp
 -- Model: /home/llama/mod/exl2/magnum-v2-123b-exl2/
 -- Options: ['tensor_parallel', 'length: 4096', 'no_flash_attn']
 -- Loading tokenizer...
 -- Loading model...
 -- Prompt format: codellama
 -- System prompt:

You are a helpful coding assistant. Always answer as helpfully as possible.

User: Hi there! Please don't crash.

Traceback (most recent call last):
  File "/home/llama/opt/exllamav2/examples/chat.py", line 295, in <module>
    generator.begin_stream_ex(active_context, settings)
  File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/site-packages/exllamav2/generator/streaming.py", line 363, in begin_stream_ex
    self._gen_begin_reuse(input_ids, gen_settings)
  File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/site-packages/exllamav2/generator/streaming.py", line 731, in _gen_begin_reuse
    self._gen_begin(in_tokens, gen_settings)
  File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/site-packages/exllamav2/generator/streaming.py", line 692, in _gen_begin
    self.model.forward(self.sequence_ids[:, :-1],
  File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/site-packages/exllamav2/model.py", line 868, in forward
    r = self.forward_chunk(
        ^^^^^^^^^^^^^^^^^^^
  File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/site-packages/exllamav2/model.py", line 976, in forward_chunk
    x = module.forward(x, cache = cache, attn_params = attn_params, past_len = past_len, loras = loras, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/site-packages/exllamav2/attn.py", line 972, in forward
    return self.forward_tp(
           ^^^^^^^^^^^^^^^^
  File "/home/llama/.pyenv/versions/3.11.9/lib/python3.11/site-packages/exllamav2/attn.py", line 1143, in forward_tp
    ext_c.tp_attn_forward_(
ModuleNotFoundError: No module named 'flash_attn_2_cuda'
turboderp commented 2 weeks ago

Oh, right, I guess this wasn't really communicated anywhere, I apologize for that. But the TP feature currently requires flash attn. :/

It's still a little experimental and unfinished, and I'll probably add SDPA support sometime soon, though it still won't work with the dynamic generator.

turboderp commented 2 weeks ago

Right, I've pushed an update to the dev branch that should allow inference with TP mode even when flash-attn isn't available. It uses a slower code path for now so it's hard to say if you'll see any speedup. Torch SDPA is very limited (it's not just paged attn that's missing but also GQA support) but I'll see about further improving performance down the line.

dirkson commented 2 weeks ago

No speedup, but it does function!

Running the following command on a 8x P100 machine: python test_inference.py -m /home/llama/mod/exl2/magnum-v2-123b-exl2/ -nfa -p "Once upon a time," -gs auto

With -tp: -- Response generated in 55.37 seconds, 128 tokens, 2.31 tokens/second (includes prompt eval.) Without -tp: -- Response generated in 23.33 seconds, 128 tokens, 5.49 tokens/second (includes prompt eval.) With -nxf: -- Response generated in 23.25 seconds, 128 tokens, 5.50 tokens/second (includes prompt eval.)

For comparison, on my hardware, I'm used to getting around 8-10 tokens/second with a similar model in GPTQ running on whatever aphrodite engine uses for tensor parallelism.

I thought specifying -nxf might prompt exllamav2 to use sdpa, and thus help prove whether it was the lack of GQA support that hurt performance so much, but it didn't actually seem to affect performance at all.