turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.53k stars 272 forks source link

Concurrency: Add opportunities to abort on processing #373

Closed bdashore3 closed 6 months ago

bdashore3 commented 6 months ago

Use threading events to propagate an abort handler from prompt processing to the model forward function. Doing this allows for termination of the while loop which is extremely useful in concurrent programs that use exl2 and allows for scaling of larger context sizes.

NOTE: This PR is only a proof of concept, it's currently passing an abort handler through multiple functions which is not ideal (and doesn't even work with draft models). I propose putting the handler in the Model class along with resetting the handler every time forward is initially called (since that's the common denominator). However, I'm not sure what other issues this will bring especially with running multiple generations in parallel.

Documentation regarding stopping running tasks with threading SuperFastPython - Stop Running Tasks Using threading.Event

turboderp commented 6 months ago

It is done.