turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.52k stars 271 forks source link

ROCm version 0.1.0, getting errors #467

Closed hvico closed 4 months ago

hvico commented 4 months ago

Hello,

I've updated from: exllamav2-0.0.20+rocm6.0-cp310-cp310-linux_x86_64.whl (was working quite well)

to

exllamav2-0.1.0+rocm6.0.torch2.3.0-cp310-cp310-linux_x86_64.whl

Now I get errors at inference with the same code I had:

/lib/python3.10/site-packages/exllamav2/model.py", line 880, in forward_chunk x = module.forward(x, cache = cache, attn_params = attn_params, past_len = past_len, loras = loras, **kwargs) /lib/python3.10/site-packages/exllamav2/attn.py", line 730, in forward attn_output, UnboundLocalError: local variable 'attn_output' referenced before assignment

Also, I see some changes on this version require recent versions of flash-attn. The latest ROCm compatbile version I found is 2.0.4.

Any suggestions besides downgrading?

Thanks!

turboderp commented 4 months ago

The changes do require flash-attn, but only for paged attention/dynamic batching support. Sadly ROCm is pretty far behind on flash-attn support and I don't want to hold back important features indefinitely waiting for it to catch up. There is a PR for xformers support which I'll probably get to soon, but it's still second-rate support because xformers still doesn't have the paging feature that makes dynamic batching possible.

All that said, 0.1.0 should still work the same as 0.0.21 as long as you're not using the new generator. The one exception, the one you're probably running into, is that I removed the multiple-caches mode because it was kind of a dead-end feature and it was really complicating the control flow. Are you actually running generations with multiple caches or is it simply that you're passing a single cache as a list?

hvico commented 4 months ago

Hi! Thanks for the response. I will check out the generation code, I am actually using EricLLM to wrap multiple exllamav2 threads, so maybe that code is using the feature you refer.

Thanks again!