huggingface / optimum-tpu

Google TPU optimizations for transformers models
Apache License 2.0
75 stars 19 forks source link

`/health` endpoint not working properly #74

Closed Edwinhr716 closed 4 months ago

Edwinhr716 commented 4 months ago

I'm planning on using the endpoint /health for liveness and readiness probes for my kubernetes deployments, but I've been running into issues.

This is the deployment that I'm testing


apiVersion: apps/v1
kind: Deployment
metadata:
  name: tgi-tpu
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tgi-tpu
  template:
    metadata:
      labels:
        app: tgi-tpu
    spec:
      nodeSelector:
        cloud.google.com/gke-tpu-topology: 2x4
        cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
      volumes:
        - name: data-volume
          emptyDir: {}
      containers:
      - name: tgi-tpu
        image: {my_image}
        args:
        - --model-id=google/gemma-2b
        - --max-concurrent-requests=4
        - --max-input-length=32
        - --max-total-tokens=64
        - --max-batch-size=1
        securityContext:
            privileged: true
        env:
          - name: HF_TOKEN
            valueFrom:
              secretKeyRef:
                name: hf-secret
                key: hf_api_token
        ports:
        - containerPort: 80
        volumeMounts:
            - name: data-volume
              mountPath: /data
        resources:
          limits:
            google.com/tpu: 8
        livenessProbe:
          httpGet:
            path: /health
            port: 80
            scheme: HTTP
          initialDelaySeconds: 300
          periodSeconds: 120

---
apiVersion: v1
kind: Service
metadata:
  name: service
spec:
  selector:
    app: tgi-tpu
  ports:
    - name: http
      protocol: TCP
      port: 8080  
      targetPort: 80  

However, I get this error related to the /health endpoint

$ kubectl describe pod tgi-tpu
...
Warning  Unhealthy  109s (x13 over 41m)  kubelet            Liveness probe failed: Get "http://10.60.7.24:80/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

I also tested out if I could reach the endpoint using curl. When I do a /generate request first, it returns successfully:

$ curl 127.0.0.1:8080/generate     -X POST     -d '{"inputs":"What is Love?","parameters":{"max_new_tokens":40}}'     -H 'Content-Type: application/json'
{"generated_text":"\n\nLove is a feeling of affection for someone or something.\n\nLove is a feeling of affection for someone or something.\n\nLove is a feeling of affection for someone or"} 
$ curl 127.0.0.1:8080/health     -X GET    
$ 

However, if I don't do a /generate request beforehand, the /health request never returns.

Looking at the router code, looks like this path is not working properly on Optimum TPU https://github.com/huggingface/text-generation-inference/blob/main/router/src/infer/health.rs#L27

alvarobartt commented 4 months ago

Hi here @Edwinhr716, AFAIK this should have been fixed as of https://github.com/huggingface/optimum-tpu/pull/66, so any of:

should work if you rebuild the containers, thanks for flagging! Also see the original issue where this was listed at https://github.com/huggingface/optimum-tpu/issues/65#issuecomment-2196871340, and kudos to @tengomucho for solving those 👏🏻

tengomucho commented 4 months ago

Hi @Edwinhr716, @alvarobartt is right we have put some effort in improving TGI robustness with optimum-tpu. Latest release should be the most solid one, let us know if you still see the issue.

Edwinhr716 commented 4 months ago

Upgrading it worked! Thanks for the help, I'll close the issue