huggingface / tgi-gaudi

Large Language Model Text Generation Inference on Habana Gaudi
http://hf.co/docs/text-generation-inference
Apache License 2.0
24 stars 38 forks source link

Clarification on past_key_values type for Starcoder #116

Closed vidyasiv closed 3 months ago

vidyasiv commented 5 months ago

System Info

Target: x86_64-unknown-linux-gnu Cargo version: 1.75.0 Commit sha: N/A Docker label: N/A nvidia-smi: N/A

Information

Tasks

Reproduction

docker run -p 8080:80 -v $VOLUME:/data --runtime=habana \
        -e HABANA_VISIBLE_DEVICES=all -e HUGGING_FACE_HUB_TOKEN=<> \
        -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --ipc=host \
        -e LOG_LEVEL=debug,text_generation_launcher=debug \
        -it --rm --entrypoint /bin/bash \
        ghcr.io/huggingface/tgi-gaudi:1.2.1 --model-id bigcode/starcoder

And the error is:

2024-04-02T21:35:26.131011Z DEBUG text_generation_launcher:   File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py", line 209, in gaudi_gpt_bigcode_model_forward
2024-04-02T21:35:26.131016Z DEBUG text_generation_launcher:     past_length = past_key_values[0].size(-2)
2024-04-02T21:35:26.131019Z DEBUG text_generation_launcher: AttributeError: 'tuple' object has no attribute 'size'

The expectation in optimum-habana function gaudi_gpt_bigcode_model_forward() is for past_key_values to be a list of tensors and it is for the above cases. However, output.past_key_values received in first forward pass here: https://github.com/huggingface/tgi-gaudi/blob/habana-main/server/text_generation_server/models/causal_lm.py#L782 is a list of tensors but before the second pass it becomes a list of tuples due to attach_kv_cache() function (code: https://github.com/huggingface/tgi-gaudi/blob/habana-main/server/text_generation_server/models/causal_lm.py#L253 ) which is incompatible with optimum-habana.

I can help fix this but need clarification of which datatype/behavior to honor:

Expected behavior

Server starts up and runs model warmup successfully and waits for requests.

kdamaszk commented 5 months ago

@vidyasiv thank you for raising this issue. I think that tgi-gaudi should support different types of KV cache. However, not only data type is different here, but overall tensor shape and content, am I right? At this moment tgi-gaudi assumes that KV cache shape is like Llama's - it is a list of tuples with tensors of shape [batch_size, num_heads, seq_length, head_dim].

vidyasiv commented 5 months ago

Thanks for explaining! Although I am confused at kv cache related code in two places i.e both tgi-gaudi and optimum-habana. Is there difference in what is implemented in each repository?

kdamaszk commented 5 months ago

What exactly is not clear? In tgi-gaudi there is only aligning data in KV cache (like shift-left operation, new request insert etc.). However, due to that operations we cannot use reuse cache flow, as KV cache has to be available from outside of the model. Communication between tgi-gaudi and optimum-habana is done mostly by forward() function and input / output arguments (one of them is KV cache).