huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.36k stars 948 forks source link

Kubernetes deployment launcher process hanging #976

Closed rickvanveen closed 3 months ago

rickvanveen commented 10 months ago

System Info

Using ghcr.io/huggingface/text-generation-inference:latest, but same issue with 0.9, and 1.02. Trying to deploy with model_id "tiiuae/falcon-7b-instruct"

Information

Tasks

Reproduction

My kubernetes deployment and pvc config:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tgi-falcon-7b-instruct-deployment
  labels:
    app: tgi-falcon-7b-instruct
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tgi-falcon-7b-instruct
  template:
    metadata:
      labels:
        app: tgi-falcon-7b-instruct
    spec:
      volumes:
        - name: tgi-falcon-7b-instruct-data
          persistentVolumeClaim:
            claimName: tgi-falcon-7b-instruct-pvc
      containers:
      - name: tgi-falcon-7b-instruct
        image: ghcr.io/huggingface/text-generation-inference:latest
        args: [ "--model-id",  "tiiuae/falcon-7b-instruct", "--num-shard",  "1" ]
        volumeMounts:
          - mountPath: /data
            name: tgi-falcon-7b-instruct-data
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: tgi-falcon-7b-instruct-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi

After applying this the only logs I get from the container are these:

2023-09-04T08:55:39.932658Z  INFO text_generation_launcher: Args { model_id: "tiiuae/falcon-7b-instruct", revision: None, validation_workers: 2, sharded: None, num_shard: Some(1), quantize: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "tgi-falcon-7b-instruct-deployment-58d4486974-cdvnz", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
2023-09-04T08:55:39.932793Z  INFO download: text_generation_launcher: Starting download process.
2023-09-04T08:55:44.934529Z  WARN text_generation_launcher: No safetensors weights found for model tiiuae/falcon-7b-instruct at revision None. Downloading PyTorch weights.

2023-09-04T08:55:45.077346Z  INFO text_generation_launcher: Download file: pytorch_model-00001-of-00002.bin

The container does download some tmpxxxx file with the same size as the model file but there it just stops. The process is still running though:

root@tgi-falcon-7b-instruct-deployment-58d4486974-cdvnz:/data# ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 08:55 ?        00:00:00 text-generation-launcher --model-id tiiuae/falcon-7b-instruct --num-shard 1
root         7     1  2 08:55 ?        00:01:20 /opt/conda/bin/python3.9 /opt/conda/bin/text-generation-server download-weights tiiuae/falcon-7b-instruct --extension .safetensors --logger-level INFO --json-outp

Expected behavior

I expect the container to download the model and serve it.

Narsil commented 10 months ago

It downloads it and convert it to safetensors. Depending on your disk speed this can take a while (and RAM + swap).

safetensors is required for sharding, and to ease the maintenance it's done by default. If you're using some cache, this steps onkly happens once though.

And you should have some logs about what is happening.

You could try -e HF_HUB_ENABLE_HF_TRANSFER=0 to disable the fast download and see what's happening during the download.

rickvanveen commented 10 months ago

I waited over the weekend, the pod was online for 3 days. I cannot imagine it would take that long.

Thank you for the suggestion, I added your suggestion like so:

env:
  - name: HF_HUB_ENABLE_HF_TRANSFER
    value: "0"

to my deployment. I'm not getting any additional output in the logs.

rickvanveen commented 10 months ago

I have to correct my earlier statement that it wasn't working with version 0.9. It is working with 0.9.3, and 0.9.4, but not with latest, 1.0.1, and 1.0.2. Also another twist I tried this with a smaller model "google/flan-t5-small" and this worked, but the falcon models still have the same issue.

OlivierDehaene commented 10 months ago

What are the IO throughputs of the disk associated with your PVC storageclass?

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.