quic / ai-hub-models

The Qualcomm® AI Hub Models are a collection of state-of-the-art machine learning models optimized for performance (latency, memory etc.) and ready to deploy on Qualcomm® devices.
https://aihub.qualcomm.com
BSD 3-Clause "New" or "Revised" License
438 stars 60 forks source link

Question on LLama 7B Q4 Metrics #18

Closed kentative closed 6 months ago

kentative commented 6 months ago

Metrics on LLama2 7B is confusing. Can you provide clarification on what the metrics represent. For example if I am interested in these metrics:

  1. Time to first token (for input of a given token length)
  2. Token per second (same input length) How do I extrapolate that from the metrics on that page?

Does this mean the input prompt was maxed out and it corresponds to 8.48 token/sec:

What does this mean, the output is 1 token?

Also, are there the memory metrics for this model?

Thanks you!

bhushan23 commented 6 months ago

Hi @kentative

There are two models in this use-case.

  1. Prompt Processor: which is used to set the context and initialize the conversation
  2. Token Generator: this is KV-Cached based token generator for fast processing of sub-sequence output generation.

Since Prompt Processor is initializer, we provide input prompt to it as is. In this case we are referring to max context length that can be fed to the model. If input ids are < max_context length, this can be achieved by prefixing with 0 / UNK

This prompt processor outputs 1024 tokens, but for our use case, only last token is of interest which we will be feeding to token generator for producing next outputs.

Let me try and answer your questions now

Time to first token (for input of a given token length)

Is time to process the prompt and output first token hence time to run e2e Prompt Processor

Token per second (same input length) How do I extrapolate that from the metrics on that page?

Since only last token is of interest, for Prompt Processor we assume all the inference time is taken to generate this one token and we disregard all previous tokens generated.

Does this mean the input prompt was maxed out and it corresponds to 8.48 token/sec: Max context length:1024 Prompt processor input:1024 tokens. Llama-TokenGenerator-KVCache-Quantized: 8.48 token/s

What does this mean, the output is 1 token? Prompt processor output:1 output token + KVCache for token generator

Here, you seem to have mixed up Prompt Processor and Token Generator. For Prompt Processor we have show cased metrics as follow

image

And since only 1 output token of prompt processor is of interest, we disregard all previous token and only consider last token. Prompt Processor generators 0.38 tokens / s (considering only last token for given inference time) Note, this prompt processor also generators state which captures the context to be used for second model.

are there the memory metrics for this model?

We do have peak inference memory metrics mentioned on the web-page. Note that this is peak inference memory and does not capture mmaped weights

does this answer your questions? happy to update our website if this is not clear.

kentative commented 6 months ago

Thank you for the detailed response. It makes more sense now.

The part where I am still not clear on is the token/sec for the prompt processor. I think it's because I am looking at it with the assumption that input token speed scales quadratically due to the attention mechanism.

For example, with the maximum input of 1024 tokens, the speed is: 0.38token/sec. What if the input is 500 tokens, the speed still 0.38token/sec or would it be faster?

bhushan23 commented 6 months ago

Prompt Processor for takes input 1024 tokens and generates 1024 output tokens in roughly 2.63 seconds. but since only 1 output token is to be considered we ignore first 1023 tokens.

The part where I am still not clear on is the token/sec for the prompt processor. I think it's because I am looking at it with the assumption that input token speed scales quadratically due to the attention mechanism.

That is correct. but in this case we are working with fixed input token size and hence not sharing metrics for varying tokens as this moment. We will soon share recipe starting with torch / onnx and will allow developers to play with different input sizes.

For example, with the maximum input of 1024 tokens, the speed is: 0.38token/sec. What if the input is 500 tokens, the speed still 0.38token/sec or would it be faster?

It should be faster. but, as mentioned above we are currently working with fixed input size shape. As Prompt Processor is not used often it should have minimal impact imo, but soon we will lift this restriction in upcoming releases.

kentative commented 6 months ago

Got it! Thank you very much, looking forward to your future updates!