[Feature Request] A way to determine which stop sequence caused the stop (or if it was instead caused by the EOS token or `max_new_tokens`)

I'm playing around with the websocket server example, and thinking about the best way to hack it to tell me why it stopped. The call to .stream() here:

https://github.com/turboderp/exllamav2/blob/3b0f5230e9619fc0ddf1f302785049d10a65fe26/exllamav2/server/websocket_actions.py#L192-L203

will only give us either True or False, indicating whether the stream has stopped for any reason.

I assumed that this would be pretty simple to fix - i.e. just go to streaming.py and swap the relevant instances of return _, True, _ for something like:

return _, {"reason":"stop_sequence", "stop_string":"foo"}, _ (or maybe stop_condition_index instead of stop_string)
return _, {"reason":"eos_token"}, _
return _, {"reason":"max_new_tokens"}, _

But it looks like it's a bit more complicated than I thought due to token healing, "filters" (I'm not sure what these are yet), and stop sequences that span multiple tokens. I think it may require edits in ExLlamaV2Sampler too? Not too sure though.

If there were official support for this that would be very handy! I will probably try to keep hacking on this if there's no interest in official support, but I doubt that what I produce will be pull-request-worthy, since I don't deal with Python code often and generally don't have much experience with this stuff.

(Thank you so much for your work on this project. Mind-blowing that I have Mixtral running at 60 tok/s on a 3090)

turboderp / exllamav2

[Feature Request] A way to determine which stop sequence caused the stop (or if it was instead caused by the EOS token or `max_new_tokens`) #266