pytorch-labs / gpt-fast

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
BSD 3-Clause "New" or "Revised" License
5.34k stars 484 forks source link

Input token length question #160

Closed kaizizzzzzz closed 2 months ago

kaizizzzzzz commented 2 months ago

For efficient inference, it seems we should exactly determine the input tokens length dynamically based on the custormer's input. But here, if in interactive mode, the input tokens length is set as max_input_token_length. I'm a little confused about this. Like in today's chat tool, they always fix the input tokens length and padding 0, or they just dynamically get the length?

msaroufim commented 2 months ago

In a production system like a chat app you're likely to be using a combination of both continuous batching and paged attention to deal with token waste - this blog is very good https://www.anyscale.com/blog/continuous-batching-llm-inference

kaizizzzzzz commented 2 months ago

Really appreciate!