Issues running finetuned versions of supported models

Edwinhr716 commented 5 months ago

I've been testing running various finetuned versions of supported models on GKE. However, it gets stuck on Using the Hugging Face API to retrieve tokenizer config

This are the full logs

2024-07-02T18:54:46.894948Z  INFO text_generation_launcher: Args { model_id: "ICTNLP/Llama-2-7b-chat-TruthX", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 4, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: Some(32), max_total_tokens: Some(64), waiting_served_ratio: 1.2, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: Some(1), cuda_graphs: None, hostname: "tgi-tpu-84bdb4d847-fdw7j", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4 }
2024-07-02T18:54:46.895049Z  INFO hf_hub: Token file not found "/root/.cache/huggingface/token"    
2024-07-02T18:54:47.190379Z  INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 32
2024-07-02T18:54:47.190398Z  INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-07-02T18:54:47.190512Z  INFO download: text_generation_launcher: Starting download process.
2024-07-02T18:54:47.270672Z  WARN text_generation_launcher: 'extension' argument is not supported and will be ignored.

2024-07-02T18:54:51.295843Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2024-07-02T18:54:51.296183Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-07-02T18:54:53.940271Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0

2024-07-02T18:54:54.000595Z  INFO shard-manager: text_generation_launcher: Shard ready in 2.703388509s rank=0
2024-07-02T18:54:54.098804Z  INFO text_generation_launcher: Starting Webserver
2024-07-02T18:54:54.121130Z  INFO text_generation_router: router/src/main.rs:185: Using the Hugging Face API
2024-07-02T18:54:54.121182Z  INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token"    
2024-07-02T18:54:54.436475Z  INFO text_generation_router: router/src/main.rs:471: Serving revision 2d186e966af6eaa237495a39433a6f6d7de3ad9e of model ICTNLP/Llama-2-7b-chat-TruthX
2024-07-02T18:54:54.436496Z  INFO text_generation_router: router/src/main.rs:253: Using config Some(Llama)
2024-07-02T18:54:54.436501Z  INFO text_generation_router: router/src/main.rs:265: Using the Hugging Face API to retrieve tokenizer config

I get this issue in the following models: RLHFlow/ArmoRM-Llama3-8B-v0.1, Trendyol/Trendyol-LLM-7b-base-v0.1, ICTNLP/Llama-2-7b-chat-TruthX.

However, I've also been able to successfully run the following models: UCLA-AGI/Llama-3-Instruct-8B-SPPO-Iter3, georgesung/llama2_7b_chat_uncensored

I was wondering if there were any requirements in terms of files that were required to run a finetuned model? Or any help debugging the issue

This is the yaml that I used to deploy

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tgi-tpu
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tgi-tpu
  template:
    metadata:
      labels:
        app: tgi-tpu
    spec:
      nodeSelector:
        cloud.google.com/gke-tpu-topology: 2x4
        cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
      volumes:
        - name: data-volume
          emptyDir: {}
      containers:
      - name: tgi-tpu
        image: {optium_tpu_image}
        args:
        - --model-id=ICTNLP/Llama-2-7b-chat-TruthX
        - --max-concurrent-requests=4
        - --max-input-length=32
        - --max-total-tokens=64
        - --max-batch-size=1
        securityContext:
            privileged: true
        env:
          - name: HF_TOKEN
            value: {my_hf_token}
        ports:
        - containerPort: 80
        volumeMounts:
            - name: data-volume
              mountPath: /data
        resources:
          limits:
            google.com/tpu: 8
---
apiVersion: v1
kind: Service
metadata:
  name: service
spec:
  selector:
    app: tgi-tpu
  ports:
    - name: http
      protocol: TCP
      port: 8080  
      targetPort: 80

tengomucho commented 5 months ago

Hi @Edwinhr716, we're investigating this, we will get back to you!

liurupeng commented 4 months ago

any update for this? why any finetuned versions will not be supported?

richardsliu commented 4 months ago

The error is:

2024-07-02T18:54:54.121182Z  INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token"

So it looks like you are missing a Huggingface token. Did you run huggingface-cli login --token [TOKEN]?

Edwinhr716 commented 4 months ago

I don't think it is related to that log. I've had the same log show up in successful deployments of optimum tpu. Since I'm building the image using docker, would running that in my terminal work anyway? I already pass the token as en environment variable.

I also want to add that I did not change anything on the dockerfile, all I did to build the image is run make tpu-tgi

richardsliu commented 4 months ago

It needs to be run in the same container as tgi-tpu. Try adding that as an init-container.

allenwang28 commented 4 months ago

@Edwinhr716 could you share the full logs? I think you'd mentioned over chat that you saw some warning like this:

 Could not find a fast tokenizer implementation for Trendyol/Trendyol-LLM-7b-base-v0.1

Edwinhr716 commented 4 months ago

Those are the full logs. That was the initial issue that I was facing, but I circumvented it by duplicating the Trendyol repo and adding the Llama2 tokenizer here: https://huggingface.co/Edwinhr716/Trendyol-LLM-7b-chat-v0.1-duplicate/tree/main. After doing that, I get the issue mentioned here.

richardsliu commented 4 months ago

Based on the comment in the slack channel (https://huggingface.slack.com/archives/C06GAFTA5AN/p1721259640899439), looks like this may be due to a known issue with TGI container serving fine-tuned models?

Edwinhr716 commented 4 months ago

Don't think so, I just tested deploying ICTNLP/Llama-2-7b-chat-TruthX using TGI 2.0.2 release and was able to do it successfully. These are the log messages for the TGI one

2024-07-18T17:34:52.776080Z  INFO text_generation_router: router/src/main.rs:195: Using the Hugging Face API
2024-07-18T17:34:52.776138Z  INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token"    
2024-07-18T17:34:53.048994Z  INFO text_generation_router: router/src/main.rs:474: Serving revision 2d186e966af6eaa237495a39433a6f6d7de3ad9e of model ICTNLP/Llama-2-7b-chat-TruthX
2024-07-18T17:34:53.087825Z  INFO text_generation_router: router/src/main.rs:289: Using config Some(Llama)
2024-07-18T17:34:53.090733Z  INFO text_generation_router: router/src/main.rs:317: Warming up model
2024-07-18T17:34:53.953134Z  INFO text_generation_launcher: Cuda Graphs are enabled for sizes [1, 2, 4, 8, 16, 32]

2024-07-18T17:34:54.490155Z  INFO text_generation_router: router/src/main.rs:354: Setting max batch total tokens to 125856
2024-07-18T17:34:54.490179Z  INFO text_generation_router: router/src/main.rs:355: Connected
2024-07-18T17:34:54.490183Z  WARN text_generation_router: router/src/main.rs:369: Invalid hostname, defaulting to 0.0.0.0

And this is the yaml that I used for the GPU one

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tgi-gemma-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gemma-server
  template:
    metadata:
      labels:
        app: gemma-server
        ai.gke.io/model: gemma-2b-1.1-it
        ai.gke.io/inference-server: text-generation-inference
        examples.ai.gke.io/source: user-guide
    spec:
      containers:
      - name: inference-server
        image: ghcr.io/huggingface/text-generation-inference:2.0.2
        resources:
          requests:
            nvidia.com/gpu: 1
          limits:
            nvidia.com/gpu: 1
        args:
        - --model-id=$(MODEL_ID)
        - --num-shard=1
        env:
        - name: MODEL_ID
          value: ICTNLP/Llama-2-7b-chat-TruthX
        - name: PORT
          value: "8000"
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-secret
              key: hf_api_token
        volumeMounts:
        - mountPath: /dev/shm
          name: dshm
      volumes:
      - name: dshm
        emptyDir:
          medium: Memory
---
apiVersion: v1
kind: Service
metadata:
  name: llm-service
spec:
  selector:
    app: gemma-server
  type: ClusterIP
  ports:
  - protocol: TCP
    port: 8000
    targetPort: 8000

Taken from https://cloud.google.com/kubernetes-engine/docs/tutorials/serve-gemma-gpu-tgi.

So It seems like it is an optimum TPU side issue

richardsliu commented 4 months ago

Looked into this with @Edwinhr716 . Optimum tpu is failing to load these models for various reasons:

RLHFlow/ArmoRM-Llama3-8B-v0.1

  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/generator.py", line 818, in _mp_fn
    generator = TpuGeneratorSingleThread.from_pretrained(model_path, revision, max_batch_size, max_sequence_length)
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/generator.py", line 776, in from_pretrained
    model = AutoModelForCausalLM.from_pretrained(
  File "/opt/optimum-tpu/optimum/tpu/modeling.py", line 64, in from_pretrained
    model = cls.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3626, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/opt/optimum-tpu/optimum/tpu/modeling_llama.py", line 1182, in __init__
    self.lm_head = ColumnParallelLinear.create(
  File "/opt/optimum-tpu/optimum/tpu/xla_model_parallel.py", line 605, in create
    return ColumnParallelLinear(
  File "/opt/optimum-tpu/optimum/tpu/xla_model_parallel.py", line 496, in __init__
    self.output_size_per_partition = divide_and_check_no_remainder(out_features, self.world_size)
  File "/opt/optimum-tpu/optimum/tpu/xla_model_parallel.py", line 218, in divide_and_check_no_remainder
    ensure_divisibility(numerator, denominator)
  File "/opt/optimum-tpu/optimum/tpu/xla_model_parallel.py", line 212, in ensure_divisibility
    assert numerator % denominator == 0, "{} is not divisible by {}".format(numerator, denominator)
AssertionError: 128257 is not divisible by 4

Trendyol/Trendyol-LLM-7b-base-v0.1

  File "/usr/local/lib/python3.10/dist-packages/torch_xla/_internal/pjrt.py", line 71, in _thread_fn
    return fn()
  File "/usr/local/lib/python3.10/dist-packages/torch_xla/_internal/pjrt.py", line 187, in __call__
    self.fn(runtime.global_ordinal(), *self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/generator.py", line 818, in _mp_fn
    generator = TpuGeneratorSingleThread.from_pretrained(model_path, revision, max_batch_size, max_sequence_length)
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/generator.py", line 776, in from_pretrained
    model = AutoModelForCausalLM.from_pretrained(
  File "/opt/optimum-tpu/optimum/tpu/modeling.py", line 64, in from_pretrained
    model = cls.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3626, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/opt/optimum-tpu/optimum/tpu/modeling_llama.py", line 1182, in __init__
    self.lm_head = ColumnParallelLinear.create(
  File "/opt/optimum-tpu/optimum/tpu/xla_model_parallel.py", line 605, in create
    return ColumnParallelLinear(
  File "/opt/optimum-tpu/optimum/tpu/xla_model_parallel.py", line 496, in __init__
    self.output_size_per_partition = divide_and_check_no_remainder(out_features, self.world_size)
  File "/opt/optimum-tpu/optimum/tpu/xla_model_parallel.py", line 218, in divide_and_check_no_remainder
    ensure_divisibility(numerator, denominator)
  File "/opt/optimum-tpu/optimum/tpu/xla_model_parallel.py", line 212, in ensure_divisibility
    assert numerator % denominator == 0, "{} is not divisible by {}".format(numerator, denominator)
AssertionError: 44222 is not divisible by 4

ICTNLP/Llama-2-7b-chat-TruthX

File "/usr/local/lib/python3.10/dist-packages/torch_xla/_internal/pjrt.py", line 71, in _thread_fn
    return fn()
  File "/usr/local/lib/python3.10/dist-packages/torch_xla/_internal/pjrt.py", line 187, in __call__
    self.fn(runtime.global_ordinal(), *self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/generator.py", line 818, in _mp_fn
    generator = TpuGeneratorSingleThread.from_pretrained(model_path, revision, max_batch_size, max_sequence_length)
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/generator.py", line 776, in from_pretrained
    model = AutoModelForCausalLM.from_pretrained(
  File "/opt/optimum-tpu/optimum/tpu/modeling.py", line 64, in from_pretrained
    model = cls.from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3305, in from_pretrained
    raise EnvironmentError(
OSError: Error no file named pytorch_model.bin, model.safetensors, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory /data/models--ICTNLP--Llama-2-7b-chat-TruthX/snapshots/2d186e966af6eaa237495a39433a6f6d7de3ad9e.

For some reason the stdout or stderr logs are not getting redirected back to the console.

tengomucho commented 4 months ago

Hi @richardsliu, for RLHFlow/ArmoRM-Llama3-8B-v0.1 and Trendyol/Trendyol-LLM-7b-base-v0.1 I think the issue is that our sharding technique for now requires the last dimension to be divisible by the number of TPUs. I think we might be able to find a workaround, otherwise for now a solution could be to increase the number of weights to align the last one to a multiple of the accelerators. For ICTNLP/Llama-2-7b-chat-TruthX the problem is that for now the TPU TGI requires the model to be in safetensors format. You can modify that in here: https://github.com/huggingface/optimum-tpu/blob/da2d1ad89d3d8a0ffb85eb5d6d6b9919e646e741/optimum/tpu/model.py#L52 Hope it helps.

huggingface / optimum-tpu

Issues running finetuned versions of supported models #67