huggingface / optimum-tpu

Google TPU optimizations for transformers models
Apache License 2.0
66 stars 17 forks source link

Issue getting Llama3 8b running on GKE #43

Open francescov1 opened 4 months ago

francescov1 commented 4 months ago

I'm trying to deploy Llama3 8b on GKE using optimum but running into some troubles.

Following instructions here: https://github.com/huggingface/optimum-tpu/tree/main/text-generation-inference. I built the docker image using the make command mentioned.

The server will start booting up, but gets stuck at "Warming up model". See logs below:

│ 2024-05-24T17:26:26.309789Z  INFO text_generation_launcher: Args { model_id: "meta-llama/Meta-Llama-3-8B", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_c │
│ 2024-05-24T17:26:26.309895Z  INFO hf_hub: Token file not found "/root/.cache/huggingface/token"                                                                                                                                                │
│ 2024-05-24T17:26:26.400493Z  INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]                                                                                                                                     │
│ 2024-05-24T17:26:26.400639Z  INFO download: text_generation_launcher: Starting download process.                                                                                                                                               │
│ 2024-05-24T17:26:26.475982Z  WARN text_generation_launcher: 'extension' argument is not supported and will be ignored.                                                                                                                         │
│                                                                                                                                                                                                                                                │
│ 2024-05-24T17:26:51.727997Z  INFO download: text_generation_launcher: Successfully downloaded weights.                                                                                                                                         │
│ 2024-05-24T17:26:51.728345Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0                                                                                                                                               │
│ 2024-05-24T17:26:54.273164Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0                                                                                                                             │
│                                                                                                                                                                                                                                                │
│ 2024-05-24T17:26:54.332635Z  INFO shard-manager: text_generation_launcher: Shard ready in 2.603384915s rank=0                                                                                                                                  │
│ 2024-05-24T17:26:54.431655Z  INFO text_generation_launcher: Starting Webserver                                                                                                                                                                 │
│ 2024-05-24T17:26:54.453486Z  INFO text_generation_router: router/src/main.rs:185: Using the Hugging Face API                                                                                                                                   │
│ 2024-05-24T17:26:54.453528Z  INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token"                                                     │
│ 2024-05-24T17:26:54.739323Z  WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.15.2/src/tokenizer/serialization.rs:159: Warning: Token '<|reserved_special_token_151|>' w

... (lots more tokenizer warnings, same as the ones above and below)

| 2024-05-24T17:26:54.739610Z  WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.15.2/src/tokenizer/serialization.rs:159: Warning: Token '<|reserved_special_token_250|>' w │
│ 2024-05-24T17:26:54.866449Z  INFO text_generation_router: router/src/main.rs:471: Serving revision 62bd457b6fe961a42a631306577e622c83876cb6 of model meta-llama/Meta-Llama-3-8B                                                                │
│ 2024-05-24T17:26:54.866479Z  INFO text_generation_router: router/src/main.rs:253: Using config Some(Llama)                                                                                                                                     │
│ 2024-05-24T17:26:54.866493Z  INFO text_generation_router: router/src/main.rs:265: Using the Hugging Face API to retrieve tokenizer config                                                                                                      │
│ 2024-05-24T17:28:23.784610Z  INFO text_generation_router: router/src/main.rs:314: Warming up model         

Here's my config:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: optimum-tpu-llama3-8b-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: optimum-tpu-llama3-8b-server
  template:
    metadata:
      labels:
        app: optimum-tpu-llama3-8b-server
    spec:
      nodeSelector:
        cloud.google.com/gke-tpu-topology: 2x4
        cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
      hostNetwork: true
      hostIPC: true
      containers:
        - name: optimum-tpu-llama3-8b-server
          image: us-central1-docker.pkg.dev/project-lighthouse-403916/tpus/optimum-tpu:latest
          securityContext:
            privileged: true
          args:
            - "--model-id=meta-llama/Meta-Llama-3-8B"
            - "--max-concurrent-requests=1"
            - "--max-input-length=512"
            - "--max-total-tokens=1024"
            - "--max-batch-prefill-tokens=512"
            - "--max-batch-total-tokens=1024"
          env:
            - name: HF_TOKEN
              value: <token>
            - name: HUGGING_FACE_HUB_TOKEN
              value: <token>
            - name: HF_BATCH_SIZE
              value: "1"
            - name: HF_SEQUENCE_LENGTH
              value: "1024"
          ports:
            - containerPort: 80
          volumeMounts:
            - name: data-volume
              mountPath: /data
          resources:
            requests:
              google.com/tpu: 8
            limits:
              google.com/tpu: 8
      volumes:
        - name: data-volume
          emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
  name: optimum-tpu-llama3-8b-svc
spec:
  selector:
    app: optimum-tpu-llama3-8b-server
  ports:
  - protocol: TCP
    port: 8080
    targetPort: 80

Any ideas?

tengomucho commented 4 months ago

Hi Francesco, Sorry we didn't have the chance to answer earlier... we'll be looking at this and get back to you soon!

carlesoctav commented 3 months ago

any updates?

tengomucho commented 3 months ago

I just re-tried this with llama3-8b and it worked fine, but I tested with a lower number of input length and total tokens. With these settinss the server takes ~15s for warmup. Can you retry this, with --max-input-length 32 --max-total-tokens 64?

francescov1 commented 3 months ago

@tengomucho Unfortunately that didn't work. I used the same manifests as above with the changes you mentioned. I also rebuilt the docker image with the latest changes from main.

What TPU are you running on? Is it possible that the v5e node is not big enough, and its unable to use multiple nodes? I can try on a v5p if that's better

tengomucho commented 3 months ago

I tried on a v5e-litepod8. The only difference I would say is that I did not use GKE, I used the docker container generated by make tpu-tgi as explained here.

francescov1 commented 3 months ago

hmm I don't see why my K8s config would be any different to that.

Is there a prebuilt public Docker image I can test out?

tengomucho commented 3 months ago

Let me cook one for you, I'll do it on Monday and I'll get back to you.

rick-c-goog commented 3 months ago

any update on this, I had the same issue with GKE, none of huggingface model works( gemma-2b, mistral, llama etc). No error in logs either, just hang with Info: Warming up model for gemma,

For Misrtral a little bit different: 2024-06-23T00:48:10.071293Z INFO hf_hub: Token file not found "/root/.cache/huggingface/token"
2024-06-23T00:48:10.199181Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32] 2024-06-23T00:48:10.199294Z INFO download: text_generation_launcher: Starting download process. 2024-06-23T00:48:10.272564Z WARN text_generation_launcher: 'extension' argument is not supported and will be ignored. 2024-06-23T00:48:56.746082Z INFO download: text_generation_launcher: Successfully downloaded weights. 2024-06-23T00:48:56.791824Z INFO shard-manager: text_generation_launcher: Starting shard rank=0 2024-06-23T00:48:59.480818Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0 2024-06-23T00:48:59.495306Z INFO shard-manager: text_generation_launcher: Shard ready in 2.702693453s rank=0 2024-06-23T00:48:59.548993Z INFO text_generation_launcher: Starting Webserver 2024-06-23T00:48:59.554356Z INFO text_generation_router: router/src/main.rs:195: Using the Hugging Face API 2024-06-23T00:48:59.554399Z INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token"
2024-06-23T00:48:59.727654Z WARN text_generation_router: router/src/main.rs:233: Could not retrieve model info from the Hugging Face hub. 2024-06-23T00:48:59.770889Z INFO text_generation_router: router/src/main.rs:289: Using config Some(Mistral) 2024-06-23T00:48:59.770904Z WARN text_generation_router: router/src/main.rs:298: no pipeline tag found for model mistralai/Mistral-7B-v0.3

rick-c-goog commented 3 months ago

At the same time, I was able to try the following example test inside GKE POD created. https://github.com/huggingface/optimum-tpu/blob/main/examples/text-generation/generation.py

rick-c-goog commented 3 months ago

@tengomucho, any comment on optimum-tpu on GKE issues or potentially public image?

tengomucho commented 3 months ago

Hey, sorry it took me longer to get this done, but you should be able to test this TGI image huggingface/optimum-tpu:v0.1.1-tgi.

rick-c-goog commented 3 months ago

Thank you, @tengomucho, got stuck/hang on same step on Warming up: 2024-06-25 11:12:01.541 EDT {fields: {…}, level: INFO, target: text_generation_launcher, timestamp: 2024-06-25T15:12:01.541489Z} 2024-06-25 11:12:01.541 EDT {fields: {…}, level: INFO, span: {…}, spans: […], target: text_generation_launcher, timestamp: 2024-06-25T15:12:01.541603Z} 2024-06-25 11:12:01.628 EDT {fields: {…}, level: WARN, target: text_generation_launcher, timestamp: 2024-06-25T15:12:01.628394Z} 2024-06-25 11:12:12.752 EDT {fields: {…}, level: INFO, span: {…}, spans: […], target: text_generation_launcher, timestamp: 2024-06-25T15:12:12.752135Z} 2024-06-25 11:12:12.752 EDT {fields: {…}, level: INFO, span: {…}, spans: […], target: text_generation_launcher, timestamp: 2024-06-25T15:12:12.752408Z} 2024-06-25 11:12:15.687 EDT {fields: {…}, level: INFO, target: text_generation_launcher, timestamp: 2024-06-25T15:12:15.687254Z} 2024-06-25 11:12:15.756 EDT {fields: {…}, level: INFO, span: {…}, spans: […], target: text_generation_launcher, timestamp: 2024-06-25T15:12:15.756244Z} 2024-06-25 11:12:15.855 EDT {fields: {…}, level: INFO, target: text_generation_launcher, timestamp: 2024-06-25T15:12:15.855187Z} 2024-06-25 11:12:15.861 EDT Using the Hugging Face API 2024-06-25 11:12:15.862 EDT Token file not found "/root/.cache/huggingface/token" 2024-06-25 11:12:16.568 EDT Could not retrieve model info from the Hugging Face hub. 2024-06-25 11:12:16.585 EDT Using config Some(Gemma) 2024-06-25 11:12:16.585 EDT Using the Hugging Face API to retrieve tokenizer config 2024-06-25 11:12:16.587 EDT no pipeline tag found for model google/gemma-2b-it 2024-06-25 11:13:03.877 EDT Warming up model

tengomucho commented 3 months ago

Umh strange, I just tested it and it worked fine. I tested with this command line BTW:

HF_TOKEN=<your_hf_token_here>
MODEL_ID=google/gemma-2b

sudo docker run --net=host \
                --privileged \
                -v $(pwd)/data:/data \
                -e HF_TOKEN=${HF_TOKEN} \
                ghcr.io/huggingface/optimum-tpu:v0.1.1-tgi \
                --model-id ${MODEL_ID} \
                --max-concurrent-requests 4 \
                --max-input-length 32 \
                --max-total-tokens 64 \
                --max-batch-size 1

And it took ~12s to warm up:

2024-06-25T15:56:14.798018Z  WARN text_generation_router: router/src/main.rs:295: no pipeline tag found for model google/gemma-2b
2024-06-25T15:57:47.220655Z  INFO text_generation_router: router/src/main.rs:314: Warming up model
2024-06-25T15:57:54.872585Z  INFO text_generation_router: router/src/main.rs:351: Setting max batch total tokens to 64
rick-c-goog commented 3 months ago

I believe it is GKE specific,

francescov1 commented 3 months ago

@tengomucho Im seeing the same thing. I retried my deployment manifest I pasted above but with the image huggingface/optimum-tpu:v0.1.1-tgi and still getting the same behavior

liurupeng commented 3 months ago

this one works for me:


kind: Deployment
metadata:
  name: tgi-tpu
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tgi-tpu
  template:
    metadata:
      labels:
        app: tgi-tpu
    spec:
      nodeSelector:
        cloud.google.com/gke-tpu-topology: 2x4
        cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
      hostNetwork: true
      volumes:
        - name: data-volume
          emptyDir: {}
      containers:
      - name: tgi-tpu
        image: {optimum-tpu-image}
        args:
        - --model-id=google/gemma-2b
        - --max-concurrent-requests=4
        - --max-input-length=32
        - --max-total-tokens=64
        - --max-batch-size=1
        securityContext:
            privileged: true
        env:
          - name: HF_TOKEN
            value: {your_token}
          - name: HUGGING_FACE_HUB_TOKEN
            value: {your_token}
        ports:
        - containerPort: 80
        volumeMounts:
            - name: data-volume
              mountPath: /data
        resources:
          requests:
            google.com/tpu: 8
          limits:
            google.com/tpu: 8
---
apiVersion: v1
kind: Service
metadata:
  name: service
spec:
  selector:
    app: tgi-tpu
  ports:
    - name: http
      protocol: TCP
      port: 8080  
      targetPort: 80  ```
rick-c-goog commented 3 months ago

thanks, @liurupeng, I got the logs as following:

2024-06-26 22:43:50.501 EDT 2024-06-27T02:43:50.500866Z  INFO shard-manager: text_generation_launcher: Shard ready in 2.703506822s rank=0 2024-06-26 22:43:50.599 EDT 2024-06-27T02:43:50.599561Z  INFO text_generation_launcher: Starting Webserver 2024-06-26 22:43:50.611 EDT 2024-06-27T02:43:50.611767Z  INFO text_generation_router: router/src/main.rs:185: Using the Hugging Face API 2024-06-26 22:43:50.611 EDT 2024-06-27T02:43:50.611800Z  INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token"
2024-06-26 22:43:51.329 EDT 2024-06-27T02:43:51.329230Z  INFO text_generation_router: router/src/main.rs:471: Serving revision 2ac59a5d7bf4e1425010f0d457dde7d146658953 of model google/gemma-2b 2024-06-26 22:43:51.329 EDT 2024-06-27T02:43:51.329250Z  INFO text_generation_router: router/src/main.rs:253: Using config Some(Gemma) 2024-06-26 22:43:51.329 EDT 2024-06-27T02:43:51.329254Z  INFO text_generation_router: router/src/main.rs:265: Using the Hugging Face API to retrieve tokenizer config 2024-06-26 22:44:48.963 EDT 2024-06-27T02:44:48.962935Z  INFO text_generation_router: router/src/main.rs:314: Warming up model 2024-06-26 22:44:55.038 EDT 2024-06-27T02:44:55.038381Z  INFO text_generation_router: router/src/main.rs:351: Setting max batch total tokens to 64 2024-06-26 22:44:55.038 EDT 2024-06-27T02:44:55.038396Z  INFO text_generation_router: router/src/main.rs:352: Connected 2024-06-26 22:44:55.038 EDT 2024-06-27T02:44:55.038401Z  WARN text_generation_router: router/src/main.rs:366: Invalid hostname, defaulting to 0.0.0.0

So, I assume the TGI model should be up and running, but the curl validation command throws connection refused error( I tried both container port 80 or 8000): kubectl run -it busybox --image radial/busyboxplus:curl If you don't see a command prompt, try pressing enter. [ root@busybox:/ ]$ curl 34.118.229.124:8080/generate \

-X POST \ -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \ -H 'Content-Type: application/json' curl: (7) Failed to connect to 34.118.229.124 port 8080: Connection refused [ root@busybox:/ ]$

Did you try the curl connection to validate?

liurupeng commented 3 months ago

@rick-c-goog I ran the below command:

kubectl port-forward svc/service 8080:8080

curl 127.0.0.1:8080/generate     -X POST     -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":40}}'     -H 'Content-Type: application/json'
rick-c-goog commented 3 months ago

Thanks, @liurupeng , the port-forward curl to 127.0.0.1 working, then busybox curl to service cluster IP afterwards

Bihan commented 3 months ago

@tengomucho I am testing optimum-tpu with v2-8 and getting similar issues as discussed above. Does optimum-tpu only supports v5e-litepod?

tengomucho commented 3 months ago

@Bihan For now we have only tested v5e configurations.

Bihan commented 3 months ago

@Bihan For now we have only tested v5e configurations.

@tengomucho Thank you for a quick a reply. Do you think testing with v2-8 or v3-8 would require a major modification?