Closed kentative closed 6 months ago
Hi @kentative
There are two models in this use-case.
Since Prompt Processor is initializer, we provide input prompt to it as is. In this case we are referring to max context length that can be fed to the model. If input ids are < max_context length, this can be achieved by prefixing with 0 / UNK
This prompt processor outputs 1024 tokens, but for our use case, only last token is of interest which we will be feeding to token generator for producing next outputs.
Let me try and answer your questions now
Time to first token (for input of a given token length)
Is time to process the prompt and output first token hence time to run e2e Prompt Processor
Token per second (same input length) How do I extrapolate that from the metrics on that page?
Since only last token is of interest, for Prompt Processor we assume all the inference time is taken to generate this one token and we disregard all previous tokens generated.
Does this mean the input prompt was maxed out and it corresponds to 8.48 token/sec: Max context length:1024 Prompt processor input:1024 tokens. Llama-TokenGenerator-KVCache-Quantized: 8.48 token/s
What does this mean, the output is 1 token? Prompt processor output:1 output token + KVCache for token generator
Here, you seem to have mixed up Prompt Processor and Token Generator. For Prompt Processor we have show cased metrics as follow
And since only 1 output token of prompt processor is of interest, we disregard all previous token and only consider last token. Prompt Processor generators 0.38 tokens / s (considering only last token for given inference time) Note, this prompt processor also generators state which captures the context to be used for second model.
are there the memory metrics for this model?
We do have peak inference memory metrics mentioned on the web-page. Note that this is peak inference memory and does not capture mmaped weights
does this answer your questions? happy to update our website if this is not clear.
Thank you for the detailed response. It makes more sense now.
The part where I am still not clear on is the token/sec for the prompt processor. I think it's because I am looking at it with the assumption that input token speed scales quadratically due to the attention mechanism.
For example, with the maximum input of 1024 tokens, the speed is: 0.38token/sec. What if the input is 500 tokens, the speed still 0.38token/sec or would it be faster?
Prompt Processor for takes input 1024 tokens and generates 1024 output tokens in roughly 2.63 seconds. but since only 1 output token is to be considered we ignore first 1023 tokens.
The part where I am still not clear on is the token/sec for the prompt processor. I think it's because I am looking at it with the assumption that input token speed scales quadratically due to the attention mechanism.
That is correct. but in this case we are working with fixed input token size and hence not sharing metrics for varying tokens as this moment. We will soon share recipe starting with torch / onnx and will allow developers to play with different input sizes.
For example, with the maximum input of 1024 tokens, the speed is: 0.38token/sec. What if the input is 500 tokens, the speed still 0.38token/sec or would it be faster?
It should be faster. but, as mentioned above we are currently working with fixed input size shape. As Prompt Processor is not used often it should have minimal impact imo, but soon we will lift this restriction in upcoming releases.
Got it! Thank you very much, looking forward to your future updates!
Metrics on LLama2 7B is confusing. Can you provide clarification on what the metrics represent. For example if I am interested in these metrics:
Does this mean the input prompt was maxed out and it corresponds to 8.48 token/sec:
What does this mean, the output is 1 token?
Also, are there the memory metrics for this model?
Thanks you!