theroyallab / tabbyAPI

An OAI compatible exllamav2 API that's both lightweight and fast
GNU Affero General Public License v3.0
461 stars 66 forks source link

[BUG] Ignores no_flash_attention: True in config, trying to use it anyway, failing on GPUs older than Ampere #135

Closed quarterturn closed 3 months ago

quarterturn commented 3 months ago
**Disclaimer:** Github Issues are **only** for code related bugs. If you do not understand how to startup or use TabbyAPI, please ask in the [Discord Server](https://discord.com/sYQxnuD7Fj)

**Describe the bug**
"no_flash_attention: True" in the config file is seemingly ignored

**To Reproduce**
Steps to reproduce the behavior:
1. export TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0 7.5 8.0 8.6 8.7 8.9"
2. re-build exllamav2 0.1.5 (main): pip install --force-reinstall --no-cache-dir -r requirements.txt && pip install --force-reinstall --no-cache-dir .
3. re-build tabbyAPI (main):  pip install --upgrade --no-cache-dir --force-reinstall .[cu124]

**Expected behavior**
The model should load successfully without flash_attention on hardware older than Ampere.

** config **
# Sample YAML file for configuration.
# Comment and uncomment values as needed. Every value has a default within the application.
# This file serves to be a drop in for config.yml

# Unless specified in the comments, DO NOT put these options in quotes!
# You can use https://www.yamllint.com/ if you want to check your YAML formatting.

# Options for networking
network:
  # The IP to host on (default: 127.0.0.1).
  # Use 0.0.0.0 to expose on all network adapters
  host: 0.0.0.0

  # The port to host on (default: 5000)
  port: 5000

  # Disable HTTP token authenticaion with requests
  # WARNING: This will make your instance vulnerable!
  # Turn on this option if you are ONLY connecting from localhost
  disable_auth: False

# Options for logging
logging:
  # Enable prompt logging (default: False)
  prompt: False

  # Enable generation parameter logging (default: False)
  generation_params: False

# Options for sampling
sampling:
  # Override preset name. Find this in the sampler-overrides folder (default: None)
  # This overrides default fallbacks for sampler values that are passed to the API
  # Server-side overrides are NOT needed by default
  # WARNING: Using this can result in a generation speed penalty
  #override_preset:

# Options for development and experimentation
developer:
  # Skips exllamav2 version check (default: False)
  # It's highly recommended to update your dependencies rather than enabling this flag
  # WARNING: Don't set this unless you know what you're doing!
  #unsafe_launch: False

  # Disable all request streaming (default: False)
  # A kill switch for turning off SSE in the API server
  #disable_request_streaming: False

  # Enable the torch CUDA malloc backend (default: False)
  # This can save a few MBs of VRAM, but has a risk of errors. Use at your own risk.
  #cuda_malloc_backend: False

# Options for model overrides and loading
model:
  # Overrides the directory to look for models (default: models)
  # Windows users, DO NOT put this path in quotes! This directory will be invalid otherwise.
  model_dir: /home/derp/exl2-models/

  # An initial model to load. Make sure the model is located in the model directory!
  # A model can be loaded later via the API.
  # REQUIRED: This must be filled out to load a model on startup!
  #model_name: command-r-plus-103B-exl2/
  model_name: llama-3-8B-iterative-DPO-final-exl2

  # Sends dummy model names when the models endpoint is queried
  # Enable this if the program is looking for a specific OAI model
  #use_dummy_models: False

  # The below parameters apply only if model_name is set

  # Max sequence length (default: Empty)
  # Fetched from the model's base sequence length in config.json by default
  #max_seq_len:

  # Overrides base model context length (default: Empty)
  # WARNING: Don't set this unless you know what you're doing!
  # Again, do NOT use this for configuring context length, use max_seq_len above ^
  # Only use this if the model's base sequence length in config.json is incorrect (ex. Mistral 7B)
  #override_base_seq_len:

  # Automatically allocate resources to GPUs (default: True)
  # NOTE: Not parsed for single GPU users
  gpu_split_auto: False
  #gpu_split_auto: True

  # Reserve VRAM used for autosplit loading (default: 96 MB on GPU 0)
  # This is represented as an array of MB per GPU used
  #autosplit_reserve: [2048,2048,2048,1024,1024]

  # An integer array of GBs of vram to split between GPUs (default: [])
  # NOTE: Not parsed for single GPU users
  gpu_split: [0,0,0,16,0]

  # Rope scale (default: 1.0)
  # Same thing as compress_pos_emb
  # Only use if your model was trained on long context with rope (check config.json)
  # Leave blank to pull the value from the model
  #rope_scale: 1.0

  # Rope alpha (default: 1.0)
  # Same thing as alpha_value
  # Leave blank to automatically calculate alpha
  #rope_alpha: 1.0

  # Disable Flash-attention 2. Set to True for GPUs lower than Nvidia's 3000 series. (default: False)
  #no_flash_attention: False
  no_flash_attention: True

  # Enable different cache modes for VRAM savings (slight performance hit).
  # Possible values FP16, FP8, Q4. (default: FP16)
  #cache_mode: FP16

  # Chunk size for prompt ingestion. A lower value reduces VRAM usage at the cost of ingestion speed (default: 2048)
  # NOTE: Effects vary depending on the model. An ideal value is between 512 and 4096
  #chunk_size: 2048

  # Set the prompt template for this model. If empty, attempts to look for the model's chat template. (default: None)
  # If a model contains multiple templates in its tokenizer_config.json, set prompt_template to the name
  # of the template you want to use.
  # NOTE: Only works with chat completion message lists!
  #prompt_template:

  # Number of experts to use PER TOKEN. Fetched from the model's config.json if not specified (default: Empty)
  # WARNING: Don't set this unless you know what you're doing!
  # NOTE: For MoE models (ex. Mixtral) only!
  #num_experts_per_token:

  # Enables CFG support (default: False)
  # WARNING: This flag disables Flash Attention! (a stopgap fix until it's fixed in upstream)
  #use_cfg: True

  # Enables fasttensors to possibly increase model loading speeds (default: False)
  #fasttensors: true

  # Options for draft models (speculative decoding). This will use more VRAM!
  #draft:
    # Overrides the directory to look for draft (default: models)
    #draft_model_dir: models

    # An initial draft model to load. Make sure this model is located in the model directory!
    # A draft model can be loaded later via the API.
    #draft_model_name: A model name

    # Rope scale for draft models (default: 1.0)
    # Same thing as compress_pos_emb
    # Only use if your draft model was trained on long context with rope (check config.json)
    #draft_rope_scale: 1.0

    # Rope alpha for draft model (default: 1.0)
    # Same thing as alpha_value
    # Leave blank to automatically calculate alpha value
    #draft_rope_alpha: 1.0

  # Options for loras
  #lora:
    # Overrides the directory to look for loras (default: loras)
    #lora_dir: loras

    # List of loras to load and associated scaling factors (default: 1.0). Comment out unused entries or add more rows as needed.
    #loras:
    #- name: lora1
    #  scaling: 1.0

**Logs**
~/tabbyAPI$ python3 main.py --config ./config.yml
INFO:     Attempting to override config.yml from args.
INFO:     ExllamaV2 version: 0.1.5
INFO:     Your API key is: xxx
INFO:     Your admin key is: xxx
INFO:
INFO:     If these keys get compromised, make sure to delete api_tokens.yml and restart the server. Have fun!
INFO:     Generation logging is disabled
WARNING:  An unsupported GPU is found in this configuration. Switching to compatibility mode.
WARNING:  This disables parallel batching and features that rely on it (ex. CFG).
WARNING:  To disable compatability mode, all GPUs must be ampere (30 series) or newer. AMD GPUs are not supported.
INFO:     Attempting to load a prompt template if present.
INFO:     Using template "from_tokenizer_config" for chat completions.
INFO:     Loading model: /home/derp/exl2-models/llama-3-8B-iterative-DPO-final-exl2
INFO:     Loading with a manual GPU split (or a one GPU setup)
Loading model modules ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 67/67 0:00:00
Traceback (most recent call last):
  File "/home/derp/tabbyAPI/main.py", line 121, in <module>
    asyncio.run(entrypoint())
  File "/home/derp/.conda/envs/exllamav2/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/home/derp/.conda/envs/exllamav2/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/derp/.conda/envs/exllamav2/lib/python3.11/asyncio/base_events.py", line 654, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/home/derp/tabbyAPI/main.py", line 109, in entrypoint
    await model.load_model(model_path.resolve(), **model_config)
  File "/home/derp/tabbyAPI/common/model.py", line 79, in load_model
    async for _ in load_model_gen(model_path, **kwargs):
  File "/home/derp/tabbyAPI/common/model.py", line 58, in load_model_gen
    async for module, modules in load_status:
  File "/home/derp/tabbyAPI/backends/exllamav2/model.py", line 490, in load_gen
    async for value in iterate_in_threadpool(model_load_generator):
  File "/home/derp/tabbyAPI/common/concurrency.py", line 30, in iterate_in_threadpool
    yield await asyncio.to_thread(gen_next, generator)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/derp/.conda/envs/exllamav2/lib/python3.11/asyncio/threads.py", line 25, in to_thread
    return await loop.run_in_executor(None, func_call)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/derp/.conda/envs/exllamav2/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/derp/tabbyAPI/common/concurrency.py", line 20, in gen_next
    return next(generator)
           ^^^^^^^^^^^^^^^
  File "/home/derp/.conda/envs/exllamav2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 56, in generator_context
    response = gen.send(request)
               ^^^^^^^^^^^^^^^^^
  File "/home/derp/tabbyAPI/backends/exllamav2/model.py", line 649, in load_model_sync
    self.model.forward(input_ids, cache=self.cache, preprocess_only=True)
  File "/home/derp/.conda/envs/exllamav2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/derp/.conda/envs/exllamav2/lib/python3.11/site-packages/exllamav2/model.py", line 792, in forward
    r = self.forward_chunk(
        ^^^^^^^^^^^^^^^^^^^
  File "/home/derp/.conda/envs/exllamav2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/derp/.conda/envs/exllamav2/lib/python3.11/site-packages/exllamav2/model.py", line 890, in forward_chunk
    x = module.forward(x, cache = cache, attn_params = attn_params, past_len = past_len, loras = loras, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/derp/.conda/envs/exllamav2/lib/python3.11/site-packages/exllamav2/attn.py", line 874, in forward
    attn_output = attn_func(batch_size, q_len, q_states, k_states, v_states, attn_params, cfg)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/derp/.conda/envs/exllamav2/lib/python3.11/site-packages/exllamav2/attn.py", line 687, in _attn_torch
    attn_output = F.scaled_dot_product_attention(q_states, k_states, v_states, attn_mask_lr)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/derp/.conda/envs/exllamav2/lib/python3.11/site-packages/torch/nn/attention/bias.py", line 281, in __torch_function__
    return cls._dispatch(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/derp/.conda/envs/exllamav2/lib/python3.11/site-packages/torch/nn/attention/bias.py", line 205, in _dispatch
    return scaled_dot_product_attention(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: FlashAttention only supports Ampere GPUs or newer.
quarterturn commented 3 months ago

I can reproduce in exui, so I guess it's actually an exllamav2 issue.

quarterturn commented 3 months ago

I'll close the issue if no comments with a few days.

turboderp commented 3 months ago

There's no way to pass an equivalent argument to ExUI. It just uses flash-attn if flash-attn available.

Tabby was also changed to do the same. Turns out it's not at all easy to decide on a "best way" to deal with all the edge cases, and having flash-attn installed on a system that doesn't support flash-attn is a little strange. Can you elaborate on your setup?

quarterturn commented 3 months ago

It doesn't seem to matter if I have flash_attn installed, at least to tabbyAPI. If I uninstall it, minimal_chat.py fails with:

AssertionError: Paged attention required Flash Attention 2.5.7 or later

I have the following GPUs: 2x 3090, 1x 2080ti, and 2x P100. If I select only the 3090s there's no problem with flash attention, as expected. But, I'd like to be able to use the P100s for big models (like CR+).

DocShotgun commented 3 months ago

This is due to the checks properly forcing non-paged compatibility mode for unsupported configurations like yours, but not forcing flash attention off as well (technically there is an edge case where users could run non-paged compatibility mode with flash attention enabled - where they have all Ampere+ GPUs and are using 2.2.1 <= flash-attn version < 2.5.7). I've made a PR that should resolve this.

Ph0rk0z commented 3 months ago

I have a similar setup and rather have xformers + bigger model than paged attention, especially at batch 1. I've been forcing the latter, so maybe that's why I have never seen this error. P100 and 3090 memory bandwidth aren't that far off. 4 of them in tensor parallel can match or exceed t/s, it's not like a P40.

Thought it would still use the dynamic generator without paged attention, at least it seemed to before. Obviously with 3 ampere cards, uninstalling flash attention isn't an option. With current hardware pricing and availability, it's not easy to say what is an "edge" case.

DocShotgun commented 3 months ago

To be clear, the edge case I was referring to is users on all Ampere+ hardware using an outdated version of flash attention and thus not having access to paged mode as a result of that. The case you are describing is meant to properly fall back to non-paged mode without an error.

You may want to try torch SDPA instead of xformers for comparison to see how that performs on torch 2.3.0+ in the latest exllamav2, as that would remove an additional dependency.

Also there is no support for tensor parallel in exllamav2 yet.

quarterturn commented 3 months ago

I tried the changes to backends/exllamav2/model.py and still get the same error

turboderp commented 3 months ago

@quarterturn I suspect this is actually a PyTorch issue, i.e. SDPA selects the flash-attn backend because the library is installed, even if it isn't supported. Can you try adding this?

torch.backends.cuda.enable_flash_sdp(False)
quarterturn commented 3 months ago

I put it here:

    134         self.cache_mode = unwrap(kwargs.get("cache_mode"), "FP16")
    135         torch.backends.cuda.enable_flash_sdp(False)
    136         # Turn off GPU split if the user is using 1 GPU
    137         gpu_count = torch.cuda.device_count()

That fixed it. Thanks!

Ph0rk0z commented 3 months ago

You may want to try torch SDPA instead of xformers

The main benefit of xformers is flash attention like reduction for context on cards that don't support it. If SDPA is just using flash attention then maybe it won't reduce memory by much. It kind of stinks when 1 card out of 4 causes context to shoot up to unmanageable levels. Will have to test.

vllm/aphrodite supports parallel, that's where that was tested.