graphrag can't index using mistral large 123B with exllamav2

turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs

MIT License

3.67k stars 282 forks source link

I don't really know what graphrag is or what sorts of requests it's sending. I take it this is with TabbyAPI?

The reason the requests show up as cancelled would likely be that connections are closed by the frontend before they finish streaming. But I have no idea if that's intentional or not. Could also be a timeout.

If you want actual concurrency for 25 requests you need a cache large enough to accommodate that, i.e. 25x the length of each prompt+max_new_tokens. Otherwise the requests that can't fit in the cache are scheduled for sequential inference instead. So what could be happening is that graphrag sends 25 requests, Tabby can fit 20 of them in the cache, they start streaming right away but the last 5 will appear to stall and maybe the frontend just gives up on them?

Just a guess.

turboderp / exllamav2

graphrag can't index using mistral large 123B with exllamav2 #582