Closed kaizizzzzzz closed 2 months ago
In a production system like a chat app you're likely to be using a combination of both continuous batching and paged attention to deal with token waste - this blog is very good https://www.anyscale.com/blog/continuous-batching-llm-inference
Really appreciate!
For efficient inference, it seems we should exactly determine the input tokens length dynamically based on the custormer's input. But here, if in interactive mode, the input tokens length is set as max_input_token_length. I'm a little confused about this. Like in today's chat tool, they always fix the input tokens length and padding 0, or they just dynamically get the length?