turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.22k stars 238 forks source link

Any cleanup needed after generation #533

Closed waterangel91 closed 1 day ago

waterangel91 commented 3 days ago

I am wrapping exllamav2 in fastapi and using the new dynamic async generator.

Just want to ask for performance, anything i should do after each generation request? As i notice my server kinda not generating after like a hundreds of requests. It might be issue on my end also, but just want to check if any cleanup i should do

turboderp commented 3 days ago

There shouldn't be any cleanup needed. I'm curious what you mean by "kinda not generating"? Is it slowing down to the point of basically not outputting anything?

It's conceivable that on some hardware it might become slightly slower as the cache gets fragmented. It does defragment the cache periodically, though, but it might be prevented from doing that if overlapping requests are coming in non-stop. Still, it shouldn't make it stop, so it doesn't sound like expected behavior. Could be something to do with how you're ending generations?

Note that TabbyAPI already has a working FastAPI server. You could use that for reference, perhaps.

waterangel91 commented 3 days ago

It does response but it becomes like super slow. For context, it becomes like 5 mins for a request on mistral q6 on rtx4090 whereas normally it would take about 5 second at most.

I would use tabbly if i could but the product that i am trying to develop kinda need exl embeded inside my fastapi endpoint.

Do you know if tabbly apply any memory fragmentation handling? I will try to learn from there code how they handle it

turboderp commented 3 days ago

Tabby does embed ExLlama in a FastAPI endpoint. If you need it to be custom, it's still good as a reference.

Anyway, I could imagine maybe you're not finishing generations properly? If you're not iterating all the way through each ExLlamaV2DynamicJobAsync, the job would never be allowed to finish, so if you're stopping a job early (like with some custom stop condition logic), you have to call await job.cancel() to actually terminate it.

You could check the jobs member of the ExLlamaV2DynamicGeneratorAsync object to see if you're accumulating unfinished async jobs, and also look at the pending_jobs and active_jobs lists in the ExLlamaV2DynamicGenerator to see what's going on there.

Defragmentation happens automatically after the completion/cancellation of a job if the queue is actually empty and the total number of pages accessed is >= the total number of pages in the cache. But I don't think fragmentation is the issue since it doesn't have a measurable impact in most cases, and in the worst case (with cache quantization specifically) it's still nowhere near what you're describing.

waterangel91 commented 3 days ago

Thank you, let me try your suggestion for troubleshooting.

One more detail i just recall if it is telltale side of any issue: When the model becomes very slow for generation, it takes like 5 mins for 1 request; but right after that particular request, it became fast like normal, and then maybe 10-15 requests later, it got the issue again.

turboderp commented 2 days ago

That suggests the previous generations are still running even if you're no longer iterating on their respective async jobs. So my guess is you have tasks that look like:

    ...
    async for result in job:
        ...
        if stop_condition(...):
            break

Which would leave those jobs still active, i.e. still taking up room in the cache and still contributing to the total batch size. And result packets are still being pushed to those async jobs by the underlying generator task, you're just not pulling those results off the queue at any point.

Eventually, you run out of space in the cache, and at some point you'll request the first result packet from a job that can't start until all those other jobs have actually finished (i.e. until they reach the token limit or a defined stop condition), causing a long delay.

If that's the case, the solution should just be to call job.cancel() before breaking the loop.

waterangel91 commented 2 days ago

Thank you. Let me try that also. Also to share, I tried the dev branch which has this change https://github.com/turboderp/exllamav2/commit/60eb8347b801107369fe6c914fcca15b74dfb095

Seems like the problem is gone on my local PC. Do you think that could be a reason? To be sure, i will rebuild the docker image using dev branch and see if problem gone.

I will also apply the job.cancel() that you mention and update again.

waterangel91 commented 1 day ago

Thanks for the help so far. I have log out all the token generation and realize that issue is with the LLM i am running e.g. Mistral 7B is quite prone to ending this endless mode of generation and not even use any of its reserve token (i used all of its special tokens as eos token).

I will adjust the generation setting to try to address it. Thank you very muchf or the help so far. Once i figure out a setting that is working, i will update here for others to refer to as well

waterangel91 commented 1 day ago

just to share, the issue is gone after i increase the repetition penalty. Share here for others who face the same issue.

settings.token_repetition_penalty = 1.11
settings.token_frequency_penalty = 0.03

I think for smaller model it is prone to get in a repeating loop, and exllamav2 default is on a a conservative side at 1.05 (i think most other back end set default at 1.1 which is a bit more aggressive) Nonetheless, i think the sweet spot is dependent on the model so try out to tinker with the different if any one face the same issue.