replicate / cog-triton

A cog implementation of Nvidia's Triton server
Apache License 2.0
11 stars 0 forks source link

Joe/lang 197 llama generation does not stop when it should eos problem #8

Closed joehoover closed 5 months ago

joehoover commented 5 months ago

This PR:

Note: Not all tokenizers set EOS to end_id=2. This needs to be configurable. I have another PR that focuses on making server-startup config based and we should handle configure end_id=2 there. Going to merge with end_id=2 default now.

linear[bot] commented 5 months ago
LANG-197 Llama generation does not stop when it should. EOS problem.

We noticed that Llama chat generations (13b specifically) were not terminating as expected. The fix turned out to be passing an `end_token` id with the request to the Triton server. While fixing this issue, we also discovered and fixed a couple bugs in how we were handling decoded token yielding: * we were mangling emojis. This was fixed using the logic we implemented previous in cog-llama-template * emission of empty strings broke our string handling such that the entire output would be yielded. The fix was a more sensible index against the previous output.