turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.23k stars 238 forks source link

[Feature Request] A way to determine which stop sequence caused the stop (or if it was instead caused by the EOS token or `max_new_tokens`) #266

Closed josephrocca closed 3 weeks ago

josephrocca commented 6 months ago

I'm playing around with the websocket server example, and thinking about the best way to hack it to tell me why it stopped. The call to .stream() here:

https://github.com/turboderp/exllamav2/blob/3b0f5230e9619fc0ddf1f302785049d10a65fe26/exllamav2/server/websocket_actions.py#L192-L203

will only give us either True or False, indicating whether the stream has stopped for any reason.

I assumed that this would be pretty simple to fix - i.e. just go to streaming.py and swap the relevant instances of return _, True, _ for something like:

But it looks like it's a bit more complicated than I thought due to token healing, "filters" (I'm not sure what these are yet), and stop sequences that span multiple tokens. I think it may require edits in ExLlamaV2Sampler too? Not too sure though.

If there were official support for this that would be very handy! I will probably try to keep hacking on this if there's no interest in official support, but I doubt that what I produce will be pull-request-worthy, since I don't deal with Python code often and generally don't have much experience with this stuff.

(Thank you so much for your work on this project. Mind-blowing that I have Mixtral running at 60 tok/s on a 3090)

turboderp commented 3 weeks ago

The new generator reports the stop reason on the last token streamed. I.e. it tells you if it was a token or a string, though it doesn't tell you which one. I'll make a note to add that information.