Closed hvico closed 4 months ago
The changes do require flash-attn, but only for paged attention/dynamic batching support. Sadly ROCm is pretty far behind on flash-attn support and I don't want to hold back important features indefinitely waiting for it to catch up. There is a PR for xformers support which I'll probably get to soon, but it's still second-rate support because xformers still doesn't have the paging feature that makes dynamic batching possible.
All that said, 0.1.0 should still work the same as 0.0.21 as long as you're not using the new generator. The one exception, the one you're probably running into, is that I removed the multiple-caches mode because it was kind of a dead-end feature and it was really complicating the control flow. Are you actually running generations with multiple caches or is it simply that you're passing a single cache as a list?
Hi! Thanks for the response. I will check out the generation code, I am actually using EricLLM to wrap multiple exllamav2 threads, so maybe that code is using the feature you refer.
Thanks again!
Hello,
I've updated from: exllamav2-0.0.20+rocm6.0-cp310-cp310-linux_x86_64.whl (was working quite well)
to
exllamav2-0.1.0+rocm6.0.torch2.3.0-cp310-cp310-linux_x86_64.whl
Now I get errors at inference with the same code I had:
/lib/python3.10/site-packages/exllamav2/model.py", line 880, in forward_chunk x = module.forward(x, cache = cache, attn_params = attn_params, past_len = past_len, loras = loras, **kwargs) /lib/python3.10/site-packages/exllamav2/attn.py", line 730, in forward attn_output, UnboundLocalError: local variable 'attn_output' referenced before assignment
Also, I see some changes on this version require recent versions of flash-attn. The latest ROCm compatbile version I found is 2.0.4.
Any suggestions besides downgrading?
Thanks!