Closed vidyasiv closed 3 months ago
@vidyasiv thank you for raising this issue. I think that tgi-gaudi should support different types of KV cache. However, not only data type is different here, but overall tensor shape and content, am I right? At this moment tgi-gaudi assumes that KV cache shape is like Llama's - it is a list of tuples with tensors of shape [batch_size, num_heads, seq_length, head_dim].
Thanks for explaining! Although I am confused at kv cache related code in two places i.e both tgi-gaudi and optimum-habana. Is there difference in what is implemented in each repository?
What exactly is not clear? In tgi-gaudi there is only aligning data in KV cache (like shift-left operation, new request insert etc.). However, due to that operations we cannot use reuse cache
flow, as KV cache has to be available from outside of the model. Communication between tgi-gaudi and optimum-habana is done mostly by forward() function and input / output arguments (one of them is KV cache).
System Info
Target: x86_64-unknown-linux-gnu Cargo version: 1.75.0 Commit sha: N/A Docker label: N/A nvidia-smi: N/A
Information
Tasks
Reproduction
And the error is:
The expectation in optimum-habana function gaudi_gpt_bigcode_model_forward() is for past_key_values to be a list of tensors and it is for the above cases. However, output.past_key_values received in first forward pass here: https://github.com/huggingface/tgi-gaudi/blob/habana-main/server/text_generation_server/models/causal_lm.py#L782 is a list of tensors but before the second pass it becomes a list of tuples due to attach_kv_cache() function (code: https://github.com/huggingface/tgi-gaudi/blob/habana-main/server/text_generation_server/models/causal_lm.py#L253 ) which is incompatible with optimum-habana.
I can help fix this but need clarification of which datatype/behavior to honor:
Expected behavior
Server starts up and runs model warmup successfully and waits for requests.