replicate / cog-triton

A cog implementation of Nvidia's Triton server
Apache License 2.0
12 stars 0 forks source link

operate on token strings instead of strings #10

Closed joehoover closed 9 months ago

joehoover commented 9 months ago

This PR reimplements stop word handling so that it operates on token strings instead of strings. This is desirable because it matches how tensorrtllm_backend handles stop words.

Specifically, the backend only stops generation IFF a sequence of tokens exactly matches the token sequence that constitutes the stop word.

The new implementation of our stop sequence handler replicates this behavior and it's primary function is to cache partial stop sequence matches until they are either fulfilled or violated. In the former case, we do not yield the stop sequence even though the backend does. In the latter case, we release the cached tokens at once.

Finally, the new version will throw an error if we register the fulfillment of a stop sequence but Triton continues generating. This can only occur in an instance where our stop sequence handler behavior does not match the backend behavior, which is a state that we must prevent.

For example, in the previous implementation, in instances where the backend continued generating after we registered a stop sequence, we would simply not emit the stop sequence, but then continue emitting tokens after the stop sequence was closed. That's maximally confusing and bad.