I'm playing around with the websocket server example, and thinking about the best way to hack it to tell me why it stopped. The call to .stream() here:
will only give us either True or False, indicating whether the stream has stopped for any reason.
I assumed that this would be pretty simple to fix - i.e. just go to streaming.py and swap the relevant instances of return _, True, _ for something like:
But it looks like it's a bit more complicated than I thought due to token healing, "filters" (I'm not sure what these are yet), and stop sequences that span multiple tokens. I think it may require edits in ExLlamaV2Sampler too? Not too sure though.
If there were official support for this that would be very handy! I will probably try to keep hacking on this if there's no interest in official support, but I doubt that what I produce will be pull-request-worthy, since I don't deal with Python code often and generally don't have much experience with this stuff.
(Thank you so much for your work on this project. Mind-blowing that I have Mixtral running at 60 tok/s on a 3090)
The new generator reports the stop reason on the last token streamed. I.e. it tells you if it was a token or a string, though it doesn't tell you which one. I'll make a note to add that information.
I'm playing around with the websocket server example, and thinking about the best way to hack it to tell me why it stopped. The call to
.stream()
here:https://github.com/turboderp/exllamav2/blob/3b0f5230e9619fc0ddf1f302785049d10a65fe26/exllamav2/server/websocket_actions.py#L192-L203
will only give us either
True
orFalse
, indicating whether the stream has stopped for any reason.I assumed that this would be pretty simple to fix - i.e. just go to
streaming.py
and swap the relevant instances ofreturn _, True, _
for something like:return _, {"reason":"stop_sequence", "stop_string":"foo"}, _
(or maybestop_condition_index
instead ofstop_string
)return _, {"reason":"eos_token"}, _
return _, {"reason":"max_new_tokens"}, _
But it looks like it's a bit more complicated than I thought due to token healing, "filters" (I'm not sure what these are yet), and stop sequences that span multiple tokens. I think it may require edits in
ExLlamaV2Sampler
too? Not too sure though.If there were official support for this that would be very handy! I will probably try to keep hacking on this if there's no interest in official support, but I doubt that what I produce will be pull-request-worthy, since I don't deal with Python code often and generally don't have much experience with this stuff.
(Thank you so much for your work on this project. Mind-blowing that I have Mixtral running at 60 tok/s on a 3090)