Open francescov1 opened 4 months ago
Hi Francesco, Sorry we didn't have the chance to answer earlier... we'll be looking at this and get back to you soon!
any updates?
I just re-tried this with llama3-8b
and it worked fine, but I tested with a lower number of input length and total tokens. With these settinss the server takes ~15s for warmup. Can you retry this, with --max-input-length 32 --max-total-tokens 64
?
@tengomucho Unfortunately that didn't work. I used the same manifests as above with the changes you mentioned. I also rebuilt the docker image with the latest changes from main.
What TPU are you running on? Is it possible that the v5e node is not big enough, and its unable to use multiple nodes? I can try on a v5p if that's better
I tried on a v5e-litepod8
. The only difference I would say is that I did not use GKE, I used the docker container generated by make tpu-tgi
as explained here.
hmm I don't see why my K8s config would be any different to that.
Is there a prebuilt public Docker image I can test out?
Let me cook one for you, I'll do it on Monday and I'll get back to you.
any update on this, I had the same issue with GKE, none of huggingface model works( gemma-2b, mistral, llama etc). No error in logs either, just hang with Info: Warming up model for gemma,
For Misrtral a little bit different:
2024-06-23T00:48:10.071293Z INFO hf_hub: Token file not found "/root/.cache/huggingface/token"
2024-06-23T00:48:10.199181Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-06-23T00:48:10.199294Z INFO download: text_generation_launcher: Starting download process.
2024-06-23T00:48:10.272564Z WARN text_generation_launcher: 'extension' argument is not supported and will be ignored.
2024-06-23T00:48:56.746082Z INFO download: text_generation_launcher: Successfully downloaded weights.
2024-06-23T00:48:56.791824Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-06-23T00:48:59.480818Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-06-23T00:48:59.495306Z INFO shard-manager: text_generation_launcher: Shard ready in 2.702693453s rank=0
2024-06-23T00:48:59.548993Z INFO text_generation_launcher: Starting Webserver
2024-06-23T00:48:59.554356Z INFO text_generation_router: router/src/main.rs:195: Using the Hugging Face API
2024-06-23T00:48:59.554399Z INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token"
2024-06-23T00:48:59.727654Z WARN text_generation_router: router/src/main.rs:233: Could not retrieve model info from the Hugging Face hub.
2024-06-23T00:48:59.770889Z INFO text_generation_router: router/src/main.rs:289: Using config Some(Mistral)
2024-06-23T00:48:59.770904Z WARN text_generation_router: router/src/main.rs:298: no pipeline tag found for model mistralai/Mistral-7B-v0.3
At the same time, I was able to try the following example test inside GKE POD created. https://github.com/huggingface/optimum-tpu/blob/main/examples/text-generation/generation.py
@tengomucho, any comment on optimum-tpu on GKE issues or potentially public image?
Hey, sorry it took me longer to get this done, but you should be able to test this TGI image huggingface/optimum-tpu:v0.1.1-tgi
.
Thank you, @tengomucho, got stuck/hang on same step on Warming up: 2024-06-25 11:12:01.541 EDT {fields: {…}, level: INFO, target: text_generation_launcher, timestamp: 2024-06-25T15:12:01.541489Z} 2024-06-25 11:12:01.541 EDT {fields: {…}, level: INFO, span: {…}, spans: […], target: text_generation_launcher, timestamp: 2024-06-25T15:12:01.541603Z} 2024-06-25 11:12:01.628 EDT {fields: {…}, level: WARN, target: text_generation_launcher, timestamp: 2024-06-25T15:12:01.628394Z} 2024-06-25 11:12:12.752 EDT {fields: {…}, level: INFO, span: {…}, spans: […], target: text_generation_launcher, timestamp: 2024-06-25T15:12:12.752135Z} 2024-06-25 11:12:12.752 EDT {fields: {…}, level: INFO, span: {…}, spans: […], target: text_generation_launcher, timestamp: 2024-06-25T15:12:12.752408Z} 2024-06-25 11:12:15.687 EDT {fields: {…}, level: INFO, target: text_generation_launcher, timestamp: 2024-06-25T15:12:15.687254Z} 2024-06-25 11:12:15.756 EDT {fields: {…}, level: INFO, span: {…}, spans: […], target: text_generation_launcher, timestamp: 2024-06-25T15:12:15.756244Z} 2024-06-25 11:12:15.855 EDT {fields: {…}, level: INFO, target: text_generation_launcher, timestamp: 2024-06-25T15:12:15.855187Z} 2024-06-25 11:12:15.861 EDT Using the Hugging Face API 2024-06-25 11:12:15.862 EDT Token file not found "/root/.cache/huggingface/token" 2024-06-25 11:12:16.568 EDT Could not retrieve model info from the Hugging Face hub. 2024-06-25 11:12:16.585 EDT Using config Some(Gemma) 2024-06-25 11:12:16.585 EDT Using the Hugging Face API to retrieve tokenizer config 2024-06-25 11:12:16.587 EDT no pipeline tag found for model google/gemma-2b-it 2024-06-25 11:13:03.877 EDT Warming up model
Umh strange, I just tested it and it worked fine. I tested with this command line BTW:
HF_TOKEN=<your_hf_token_here>
MODEL_ID=google/gemma-2b
sudo docker run --net=host \
--privileged \
-v $(pwd)/data:/data \
-e HF_TOKEN=${HF_TOKEN} \
ghcr.io/huggingface/optimum-tpu:v0.1.1-tgi \
--model-id ${MODEL_ID} \
--max-concurrent-requests 4 \
--max-input-length 32 \
--max-total-tokens 64 \
--max-batch-size 1
And it took ~12s to warm up:
2024-06-25T15:56:14.798018Z WARN text_generation_router: router/src/main.rs:295: no pipeline tag found for model google/gemma-2b
2024-06-25T15:57:47.220655Z INFO text_generation_router: router/src/main.rs:314: Warming up model
2024-06-25T15:57:54.872585Z INFO text_generation_router: router/src/main.rs:351: Setting max batch total tokens to 64
I believe it is GKE specific,
@tengomucho Im seeing the same thing. I retried my deployment manifest I pasted above but with the image huggingface/optimum-tpu:v0.1.1-tgi
and still getting the same behavior
this one works for me:
kind: Deployment
metadata:
name: tgi-tpu
spec:
replicas: 1
selector:
matchLabels:
app: tgi-tpu
template:
metadata:
labels:
app: tgi-tpu
spec:
nodeSelector:
cloud.google.com/gke-tpu-topology: 2x4
cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
hostNetwork: true
volumes:
- name: data-volume
emptyDir: {}
containers:
- name: tgi-tpu
image: {optimum-tpu-image}
args:
- --model-id=google/gemma-2b
- --max-concurrent-requests=4
- --max-input-length=32
- --max-total-tokens=64
- --max-batch-size=1
securityContext:
privileged: true
env:
- name: HF_TOKEN
value: {your_token}
- name: HUGGING_FACE_HUB_TOKEN
value: {your_token}
ports:
- containerPort: 80
volumeMounts:
- name: data-volume
mountPath: /data
resources:
requests:
google.com/tpu: 8
limits:
google.com/tpu: 8
---
apiVersion: v1
kind: Service
metadata:
name: service
spec:
selector:
app: tgi-tpu
ports:
- name: http
protocol: TCP
port: 8080
targetPort: 80 ```
thanks, @liurupeng, I got the logs as following:
2024-06-26 22:43:50.501 EDT
[2m2024-06-27T02:43:50.500866Z[0m [32m INFO[0m [1mshard-manager[0m: [2mtext_generation_launcher[0m[2m:[0m Shard ready in 2.703506822s [2m[3mrank[0m[2m=[0m0[0m
2024-06-26 22:43:50.599 EDT
[2m2024-06-27T02:43:50.599561Z[0m [32m INFO[0m [2mtext_generation_launcher[0m[2m:[0m Starting Webserver
2024-06-26 22:43:50.611 EDT
[2m2024-06-27T02:43:50.611767Z[0m [32m INFO[0m [2mtext_generation_router[0m[2m:[0m [2mrouter/src/main.rs[0m[2m:[0m[2m185:[0m Using the Hugging Face API
2024-06-26 22:43:50.611 EDT
[2m2024-06-27T02:43:50.611800Z[0m [32m INFO[0m [2mhf_hub[0m[2m:[0m [2m/usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs[0m[2m:[0m[2m55:[0m Token file not found "/root/.cache/huggingface/token"
2024-06-26 22:43:51.329 EDT
[2m2024-06-27T02:43:51.329230Z[0m [32m INFO[0m [2mtext_generation_router[0m[2m:[0m [2mrouter/src/main.rs[0m[2m:[0m[2m471:[0m Serving revision 2ac59a5d7bf4e1425010f0d457dde7d146658953 of model google/gemma-2b
2024-06-26 22:43:51.329 EDT
[2m2024-06-27T02:43:51.329250Z[0m [32m INFO[0m [2mtext_generation_router[0m[2m:[0m [2mrouter/src/main.rs[0m[2m:[0m[2m253:[0m Using config Some(Gemma)
2024-06-26 22:43:51.329 EDT
[2m2024-06-27T02:43:51.329254Z[0m [32m INFO[0m [2mtext_generation_router[0m[2m:[0m [2mrouter/src/main.rs[0m[2m:[0m[2m265:[0m Using the Hugging Face API to retrieve tokenizer config
2024-06-26 22:44:48.963 EDT
[2m2024-06-27T02:44:48.962935Z[0m [32m INFO[0m [2mtext_generation_router[0m[2m:[0m [2mrouter/src/main.rs[0m[2m:[0m[2m314:[0m Warming up model
2024-06-26 22:44:55.038 EDT
[2m2024-06-27T02:44:55.038381Z[0m [32m INFO[0m [2mtext_generation_router[0m[2m:[0m [2mrouter/src/main.rs[0m[2m:[0m[2m351:[0m Setting max batch total tokens to 64
2024-06-26 22:44:55.038 EDT
[2m2024-06-27T02:44:55.038396Z[0m [32m INFO[0m [2mtext_generation_router[0m[2m:[0m [2mrouter/src/main.rs[0m[2m:[0m[2m352:[0m Connected
2024-06-26 22:44:55.038 EDT
[2m2024-06-27T02:44:55.038401Z[0m [33m WARN[0m [2mtext_generation_router[0m[2m:[0m [2mrouter/src/main.rs[0m[2m:[0m[2m366:[0m Invalid hostname, defaulting to 0.0.0.0
So, I assume the TGI model should be up and running, but the curl validation command throws connection refused error( I tried both container port 80 or 8000): kubectl run -it busybox --image radial/busyboxplus:curl If you don't see a command prompt, try pressing enter. [ root@busybox:/ ]$ curl 34.118.229.124:8080/generate \
-X POST \ -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \ -H 'Content-Type: application/json' curl: (7) Failed to connect to 34.118.229.124 port 8080: Connection refused [ root@busybox:/ ]$
Did you try the curl connection to validate?
@rick-c-goog I ran the below command:
kubectl port-forward svc/service 8080:8080
curl 127.0.0.1:8080/generate -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":40}}' -H 'Content-Type: application/json'
Thanks, @liurupeng , the port-forward curl to 127.0.0.1 working, then busybox curl to service cluster IP afterwards
@tengomucho I am testing optimum-tpu with v2-8
and getting similar issues as discussed above. Does optimum-tpu only supports v5e-litepod
?
@Bihan For now we have only tested v5e
configurations.
@Bihan For now we have only tested
v5e
configurations.
@tengomucho Thank you for a quick a reply. Do you think testing with v2-8 or v3-8 would require a major modification?
I'm trying to deploy Llama3 8b on GKE using optimum but running into some troubles.
Following instructions here: https://github.com/huggingface/optimum-tpu/tree/main/text-generation-inference. I built the docker image using the make command mentioned.
The server will start booting up, but gets stuck at "Warming up model". See logs below:
Here's my config:
Any ideas?