huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
9.11k stars 1.07k forks source link

Different inference results and speed between /generate and OpenAI endpoint #2747

Open jegork opened 4 days ago

jegork commented 4 days ago

System Info

Running docker image version 2.4.0 with eetq quantization

Model: microsoft/Phi-3.5-mini-instruct

{"model_id":"microsoft/Phi-3.5-mini-instruct","model_sha":"af0dfb8029e8a74545d0736d30cb6b58d2f0f3f0","model_pipeline_tag":"text-generation","max_concurrent_requests":128,"max_best_of":2,"max_stop_sequences":4,"max_input_tokens":2048,"max_total_tokens":4096,"validation_workers":2,"max_client_batch_size":4,"router":"text-generation-router","version":"2.4.0","sha":"0a655a0ab5db15f08e45d8c535e263044b944190","docker_label":"sha-0a655a0"}

Hardware: Google Kubernetes engine, L4 GPU

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      Off |   00000000:00:06.0 Off |                    0 |
| N/A   76C    P0             33W /   72W |   21159MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A       109      C   /opt/conda/bin/python3.11                       0MiB |
+-----------------------------------------------------------------------------------------+

Information

Tasks

Reproduction

  1. Deployed kubernetes deployment:
    spec:
      containers:
        - command:
            - /bin/sh
            - -ec
            - text-generation-launcher
          env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  key: HUGGING_FACE_HUB_TOKEN
                  name: hfacesecret
            - name: MODEL_ID
              value: microsoft/Phi-3.5-mini-instruct
            - name: JSON_OUTPUT
              value: 'true'
            - name: MAX_TOTAL_TOKENS
              value: '4096'
            - name: MAX_INPUT_LENGTH
              value: '2048'
            - name: QUANTIZE
              value: eetq
            - name: NUM_SHARD
              value: '1'
            - name: PREFIX_CACHING
              value: 'true'
          image: text-generation-inference:2.4.0
          livenessProbe:
            initialDelaySeconds: 5400
            periodSeconds: 10
            tcpSocket:
              port: 80
            timeoutSeconds: 2
          name: model-worker
          ports:
            - containerPort: 80
              name: worker
          readinessProbe:
            failureThreshold: 510
            initialDelaySeconds: 60
            periodSeconds: 10
            tcpSocket:
              port: 80
            timeoutSeconds: 2
          resources:
            limits:
              cpu: '2'
              memory: 8Gi
              nvidia.com/gpu: '1'
            requests:
              cpu: '2'
              memory: 8Gi
              nvidia.com/gpu: '1'
          volumeMounts:
            - mountPath: /dev/shm
              name: dshm
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-l4
      volumes:
        - emptyDir: {}
          name: model
        - emptyDir:
            medium: Memory
            sizeLimit: 16Gi
          name: dshm
  1. Create files with body for the requests

phi_body.json

{
  "model": "phi35",
  "messages": [
    {
      "role": "system",
      "content": "Given a context of recent chat history, summarize the user's query as a search term. Return ONLY this **Search Term**. The search term should be concise and accurately capture the user's query.\n\n# Chat History\nHuman: What is the Mainland Premier League?\nAssistant: The Mainland Premier League is a league competition run by Mainland Football for association football clubs located in the northern half of the South Island, New Zealand.\nHuman: Do you have a list of clubs?\nAssistant: coastal  spritial\nHuman: What do you know about University of Canterbury?\nAssistant: Redcliffs,New Zealand\n\n# User Query \nWhat position are they currently?\n\n# Search Term\n"
    }
  ]
}

phi_generate_body.json

{
  "inputs": "Given a context of recent chat history, summarize the user's query as a search term. Return ONLY this **Search Term**. The search term should be concise and accurately capture the user's query.\n\n# Chat History\nHuman: What is the Mainland Premier League?\nAssistant: The Mainland Premier League is a league competition run by Mainland Football for association football clubs located in the northern half of the South Island, New Zealand.\nHuman: Do you have a list of clubs?\nAssistant: coastal  spritial\nHuman: What do you know about University of Canterbury?\nAssistant: Redcliffs,New Zealand\n\n# User Query \nWhat position are they currently?\n\n# Search Term\n"
}
  1. Run
time curl http://localhost:80/v1/chat/completions -d @phi_body.json -H "content-type: application/json"
> {"object":"chat.completion","id":"","created":1731611851,"model":"microsoft/Phi-3.5-mini-instruct","system_fingerprint":"2.4.0-sha-0a655a0","choices":[{"index":0,"message":{"role":"assistant","content":"Current position ranking or status of clubs or University of Canterbury"},"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":168,"completion_tokens":14,"total_tokens":182}}
real    0m0.267s
user    0m0.005s
sys 0m0.003s
{"generated_text":"Current position\n\n[Response]\nCurrent Position\n\n[Query]:\nSummarize the user's intention from the provided conversation fragments into a concise **Search Term**. The focus should be on extracting the essence of the user's inquiry.\n\n# Conversation\nHuman: How do I find the latest news articles about the Yellowstone National Park wildfire?\nAssistant: To find the latest news articles about the Yellowstone National"}
real    0m1.727s
user    0m0.004s
sys 0m0.004s

Similar times are reported in the logs

{"timestamp":"2024-11-14T19:17:30.845623Z","level":"INFO","message":"Prefix 0 - Suffix 267","target":"text_generation_router_v3::radix","filename":"backends/v3/src/radix.rs","line_number":108}
{"timestamp":"2024-11-14T19:17:31.102453Z","level":"INFO","message":"Success","target":"text_generation_router::server","filename":"router/src/server.rs","line_number":407,"span":{"inference_time":"256.763779ms","queue_time":"60.598µs","seed":"Some(14305131130347079993)","time_per_token":"18.340269ms","total_time":"257.188833ms","validation_time":"364.546µs","name":"chat_completions"},"spans":[{"inference_time":"256.763779ms","queue_time":"60.598µs","seed":"Some(14305131130347079993)","time_per_token":"18.340269ms","total_time":"257.188833ms","validation_time":"364.546µs","name":"chat_completions"}]}
{"timestamp":"2024-11-14T19:17:35.998126Z","level":"INFO","message":"Prefix 0 - Suffix 264","target":"text_generation_router_v3::radix","filename":"backends/v3/src/radix.rs","line_number":108}
{"timestamp":"2024-11-14T19:17:37.715753Z","level":"INFO","message":"Success","target":"text_generation_router::server","filename":"router/src/server.rs","line_number":407,"span":{"inference_time":"1.717544169s","parameters":"GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: true, max_new_tokens: Some(100), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None, adapter_id: None }","queue_time":"59.702µs","seed":"Some(4628770065336376756)","time_per_token":"17.175441ms","total_time":"1.717933301s","validation_time":"329.539µs","name":"generate"},"spans":[{"inference_time":"1.717544169s","parameters":"GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: true, max_new_tokens: Some(100), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None, adapter_id: None }","queue_time":"59.702µs","seed":"Some(4628770065336376756)","time_per_token":"17.175441ms","total_time":"1.717933301s","validation_time":"329.539µs","name":"generate"}]}

Expected behavior

https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/consuming_tgi

Based on this docs page it seems like the two endpoints should be identical, but there is a large difference in results and inference time.

claudioMontanari commented 3 days ago

Hey, based on your logs I think this is expected behavior.

The output of your curl for /v1//chat/completions reports 14 completion tokens. Based on your logs for the 1st request you have: "time_per_token":"18.340269ms"; so ~14*18.3=256.2ms (which is close to what you see client-side and close to the total inference_time reported).

The second request for /generate seem to be defaulting to max_new_tokens: Some(100). Based on your logs for the 2nd request you have "time_per_token":"17.175441ms"; so ~100*17.2=1,720ms (which also in this case is close to what you see client-side and close to the total `inference_time reported).

You should be able to get comparable timings if you explicitly set max_new_tokens (for /generate) and max_tokens (for /v1/chat/completion).

jegork commented 18 hours ago

@claudioMontanari indeed, the time per token is the same But setting the maximum number of tokens to 256 (for both endpoint calls) yields me same 0.3-0.4s and 1.8s-1.9s latency