turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.67k stars 282 forks source link

graphrag can't index using mistral large 123B with exllamav2 #582

Open xxll88 opened 3 months ago

xxll88 commented 3 months ago

graphrag concurrent_requests: 25 , timeout:180 after 180s, only 5 chat complete , there are too many chat timeout like : ERROR: Chat completion 2a725977bfa24ff5ad768d0f0cf563d7 cancelled by user. ERROR: Chat completion b1854e0f70b14a6c906310c3e5a7a7c6 cancelled by user. ERROR: Chat completion 9aa757db07ba407dab480a61dcd1f44a cancelled by user. ERROR: Chat completion faab131c61564578b653c5cda80494fe cancelled by user. ERROR: Chat completion ce1e6fe36f1a4166935bbe211188cbf1 cancelled by user

why and how ?

turboderp commented 3 months ago

I don't really know what graphrag is or what sorts of requests it's sending. I take it this is with TabbyAPI?

The reason the requests show up as cancelled would likely be that connections are closed by the frontend before they finish streaming. But I have no idea if that's intentional or not. Could also be a timeout.

If you want actual concurrency for 25 requests you need a cache large enough to accommodate that, i.e. 25x the length of each prompt+max_new_tokens. Otherwise the requests that can't fit in the cache are scheduled for sequential inference instead. So what could be happening is that graphrag sends 25 requests, Tabby can fit 20 of them in the cache, they start streaming right away but the last 5 will appear to stall and maybe the frontend just gives up on them?

Just a guess.