Open Edwinhr716 opened 5 months ago
Hi @Edwinhr716, we're investigating this, we will get back to you!
any update for this? why any finetuned versions will not be supported?
The error is:
2024-07-02T18:54:54.121182Z INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token"
So it looks like you are missing a Huggingface token. Did you run huggingface-cli login --token [TOKEN]
?
I don't think it is related to that log. I've had the same log show up in successful deployments of optimum tpu. Since I'm building the image using docker, would running that in my terminal work anyway? I already pass the token as en environment variable.
I also want to add that I did not change anything on the dockerfile, all I did to build the image is run make tpu-tgi
It needs to be run in the same container as tgi-tpu. Try adding that as an init-container.
@Edwinhr716 could you share the full logs? I think you'd mentioned over chat that you saw some warning like this:
Could not find a fast tokenizer implementation for Trendyol/Trendyol-LLM-7b-base-v0.1
Those are the full logs. That was the initial issue that I was facing, but I circumvented it by duplicating the Trendyol repo and adding the Llama2 tokenizer here: https://huggingface.co/Edwinhr716/Trendyol-LLM-7b-chat-v0.1-duplicate/tree/main. After doing that, I get the issue mentioned here.
Based on the comment in the slack channel (https://huggingface.slack.com/archives/C06GAFTA5AN/p1721259640899439), looks like this may be due to a known issue with TGI container serving fine-tuned models?
Don't think so, I just tested deploying ICTNLP/Llama-2-7b-chat-TruthX using TGI 2.0.2 release and was able to do it successfully. These are the log messages for the TGI one
2024-07-18T17:34:52.776080Z INFO text_generation_router: router/src/main.rs:195: Using the Hugging Face API
2024-07-18T17:34:52.776138Z INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token"
2024-07-18T17:34:53.048994Z INFO text_generation_router: router/src/main.rs:474: Serving revision 2d186e966af6eaa237495a39433a6f6d7de3ad9e of model ICTNLP/Llama-2-7b-chat-TruthX
2024-07-18T17:34:53.087825Z INFO text_generation_router: router/src/main.rs:289: Using config Some(Llama)
2024-07-18T17:34:53.090733Z INFO text_generation_router: router/src/main.rs:317: Warming up model
2024-07-18T17:34:53.953134Z INFO text_generation_launcher: Cuda Graphs are enabled for sizes [1, 2, 4, 8, 16, 32]
2024-07-18T17:34:54.490155Z INFO text_generation_router: router/src/main.rs:354: Setting max batch total tokens to 125856
2024-07-18T17:34:54.490179Z INFO text_generation_router: router/src/main.rs:355: Connected
2024-07-18T17:34:54.490183Z WARN text_generation_router: router/src/main.rs:369: Invalid hostname, defaulting to 0.0.0.0
And this is the yaml that I used for the GPU one
apiVersion: apps/v1
kind: Deployment
metadata:
name: tgi-gemma-deployment
spec:
replicas: 1
selector:
matchLabels:
app: gemma-server
template:
metadata:
labels:
app: gemma-server
ai.gke.io/model: gemma-2b-1.1-it
ai.gke.io/inference-server: text-generation-inference
examples.ai.gke.io/source: user-guide
spec:
containers:
- name: inference-server
image: ghcr.io/huggingface/text-generation-inference:2.0.2
resources:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1
args:
- --model-id=$(MODEL_ID)
- --num-shard=1
env:
- name: MODEL_ID
value: ICTNLP/Llama-2-7b-chat-TruthX
- name: PORT
value: "8000"
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: hf_api_token
volumeMounts:
- mountPath: /dev/shm
name: dshm
volumes:
- name: dshm
emptyDir:
medium: Memory
---
apiVersion: v1
kind: Service
metadata:
name: llm-service
spec:
selector:
app: gemma-server
type: ClusterIP
ports:
- protocol: TCP
port: 8000
targetPort: 8000
Taken from https://cloud.google.com/kubernetes-engine/docs/tutorials/serve-gemma-gpu-tgi.
So It seems like it is an optimum TPU side issue
Looked into this with @Edwinhr716 . Optimum tpu is failing to load these models for various reasons:
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/generator.py", line 818, in _mp_fn
generator = TpuGeneratorSingleThread.from_pretrained(model_path, revision, max_batch_size, max_sequence_length)
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/generator.py", line 776, in from_pretrained
model = AutoModelForCausalLM.from_pretrained(
File "/opt/optimum-tpu/optimum/tpu/modeling.py", line 64, in from_pretrained
model = cls.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3626, in from_pretrained
model = cls(config, *model_args, **model_kwargs)
File "/opt/optimum-tpu/optimum/tpu/modeling_llama.py", line 1182, in __init__
self.lm_head = ColumnParallelLinear.create(
File "/opt/optimum-tpu/optimum/tpu/xla_model_parallel.py", line 605, in create
return ColumnParallelLinear(
File "/opt/optimum-tpu/optimum/tpu/xla_model_parallel.py", line 496, in __init__
self.output_size_per_partition = divide_and_check_no_remainder(out_features, self.world_size)
File "/opt/optimum-tpu/optimum/tpu/xla_model_parallel.py", line 218, in divide_and_check_no_remainder
ensure_divisibility(numerator, denominator)
File "/opt/optimum-tpu/optimum/tpu/xla_model_parallel.py", line 212, in ensure_divisibility
assert numerator % denominator == 0, "{} is not divisible by {}".format(numerator, denominator)
AssertionError: 128257 is not divisible by 4
Trendyol/Trendyol-LLM-7b-base-v0.1
File "/usr/local/lib/python3.10/dist-packages/torch_xla/_internal/pjrt.py", line 71, in _thread_fn
return fn()
File "/usr/local/lib/python3.10/dist-packages/torch_xla/_internal/pjrt.py", line 187, in __call__
self.fn(runtime.global_ordinal(), *self.args, **self.kwargs)
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/generator.py", line 818, in _mp_fn
generator = TpuGeneratorSingleThread.from_pretrained(model_path, revision, max_batch_size, max_sequence_length)
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/generator.py", line 776, in from_pretrained
model = AutoModelForCausalLM.from_pretrained(
File "/opt/optimum-tpu/optimum/tpu/modeling.py", line 64, in from_pretrained
model = cls.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3626, in from_pretrained
model = cls(config, *model_args, **model_kwargs)
File "/opt/optimum-tpu/optimum/tpu/modeling_llama.py", line 1182, in __init__
self.lm_head = ColumnParallelLinear.create(
File "/opt/optimum-tpu/optimum/tpu/xla_model_parallel.py", line 605, in create
return ColumnParallelLinear(
File "/opt/optimum-tpu/optimum/tpu/xla_model_parallel.py", line 496, in __init__
self.output_size_per_partition = divide_and_check_no_remainder(out_features, self.world_size)
File "/opt/optimum-tpu/optimum/tpu/xla_model_parallel.py", line 218, in divide_and_check_no_remainder
ensure_divisibility(numerator, denominator)
File "/opt/optimum-tpu/optimum/tpu/xla_model_parallel.py", line 212, in ensure_divisibility
assert numerator % denominator == 0, "{} is not divisible by {}".format(numerator, denominator)
AssertionError: 44222 is not divisible by 4
File "/usr/local/lib/python3.10/dist-packages/torch_xla/_internal/pjrt.py", line 71, in _thread_fn
return fn()
File "/usr/local/lib/python3.10/dist-packages/torch_xla/_internal/pjrt.py", line 187, in __call__
self.fn(runtime.global_ordinal(), *self.args, **self.kwargs)
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/generator.py", line 818, in _mp_fn
generator = TpuGeneratorSingleThread.from_pretrained(model_path, revision, max_batch_size, max_sequence_length)
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/generator.py", line 776, in from_pretrained
model = AutoModelForCausalLM.from_pretrained(
File "/opt/optimum-tpu/optimum/tpu/modeling.py", line 64, in from_pretrained
model = cls.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3305, in from_pretrained
raise EnvironmentError(
OSError: Error no file named pytorch_model.bin, model.safetensors, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory /data/models--ICTNLP--Llama-2-7b-chat-TruthX/snapshots/2d186e966af6eaa237495a39433a6f6d7de3ad9e.
For some reason the stdout or stderr logs are not getting redirected back to the console.
Hi @richardsliu, for RLHFlow/ArmoRM-Llama3-8B-v0.1 and Trendyol/Trendyol-LLM-7b-base-v0.1 I think the issue is that our sharding technique for now requires the last dimension to be divisible by the number of TPUs. I think we might be able to find a workaround, otherwise for now a solution could be to increase the number of weights to align the last one to a multiple of the accelerators. For ICTNLP/Llama-2-7b-chat-TruthX the problem is that for now the TPU TGI requires the model to be in safetensors format. You can modify that in here: https://github.com/huggingface/optimum-tpu/blob/da2d1ad89d3d8a0ffb85eb5d6d6b9919e646e741/optimum/tpu/model.py#L52 Hope it helps.
I've been testing running various finetuned versions of supported models on GKE. However, it gets stuck on
Using the Hugging Face API to retrieve tokenizer config
This are the full logs
I get this issue in the following models: RLHFlow/ArmoRM-Llama3-8B-v0.1, Trendyol/Trendyol-LLM-7b-base-v0.1, ICTNLP/Llama-2-7b-chat-TruthX.
However, I've also been able to successfully run the following models: UCLA-AGI/Llama-3-Instruct-8B-SPPO-Iter3, georgesung/llama2_7b_chat_uncensored
I was wondering if there were any requirements in terms of files that were required to run a finetuned model? Or any help debugging the issue
This is the yaml that I used to deploy