Closed quarterturn closed 3 months ago
I can reproduce in exui, so I guess it's actually an exllamav2 issue.
I'll close the issue if no comments with a few days.
There's no way to pass an equivalent argument to ExUI. It just uses flash-attn if flash-attn available.
Tabby was also changed to do the same. Turns out it's not at all easy to decide on a "best way" to deal with all the edge cases, and having flash-attn installed on a system that doesn't support flash-attn is a little strange. Can you elaborate on your setup?
It doesn't seem to matter if I have flash_attn installed, at least to tabbyAPI. If I uninstall it, minimal_chat.py fails with:
AssertionError: Paged attention required Flash Attention 2.5.7 or later
I have the following GPUs: 2x 3090, 1x 2080ti, and 2x P100. If I select only the 3090s there's no problem with flash attention, as expected. But, I'd like to be able to use the P100s for big models (like CR+).
This is due to the checks properly forcing non-paged compatibility mode for unsupported configurations like yours, but not forcing flash attention off as well (technically there is an edge case where users could run non-paged compatibility mode with flash attention enabled - where they have all Ampere+ GPUs and are using 2.2.1 <= flash-attn version < 2.5.7
). I've made a PR that should resolve this.
I have a similar setup and rather have xformers + bigger model than paged attention, especially at batch 1. I've been forcing the latter, so maybe that's why I have never seen this error. P100 and 3090 memory bandwidth aren't that far off. 4 of them in tensor parallel can match or exceed t/s, it's not like a P40.
Thought it would still use the dynamic generator without paged attention, at least it seemed to before. Obviously with 3 ampere cards, uninstalling flash attention isn't an option. With current hardware pricing and availability, it's not easy to say what is an "edge" case.
To be clear, the edge case I was referring to is users on all Ampere+ hardware using an outdated version of flash attention and thus not having access to paged mode as a result of that. The case you are describing is meant to properly fall back to non-paged mode without an error.
You may want to try torch SDPA instead of xformers for comparison to see how that performs on torch 2.3.0+ in the latest exllamav2, as that would remove an additional dependency.
Also there is no support for tensor parallel in exllamav2 yet.
I tried the changes to backends/exllamav2/model.py and still get the same error
@quarterturn I suspect this is actually a PyTorch issue, i.e. SDPA selects the flash-attn backend because the library is installed, even if it isn't supported. Can you try adding this?
torch.backends.cuda.enable_flash_sdp(False)
I put it here:
134 self.cache_mode = unwrap(kwargs.get("cache_mode"), "FP16")
135 torch.backends.cuda.enable_flash_sdp(False)
136 # Turn off GPU split if the user is using 1 GPU
137 gpu_count = torch.cuda.device_count()
That fixed it. Thanks!
You may want to try torch SDPA instead of xformers
The main benefit of xformers is flash attention like reduction for context on cards that don't support it. If SDPA is just using flash attention then maybe it won't reduce memory by much. It kind of stinks when 1 card out of 4 causes context to shoot up to unmanageable levels. Will have to test.
vllm/aphrodite supports parallel, that's where that was tested.