TGI keeps referencing the default model in the image (bigscience/bloom)

BeylasanRuzaiqi commented 1 month ago

System Info

I have deployed TGI on a nvidia GPU successfully but when when downloading another model from HuggingFace, it keeps referring the model bigscience/bloom-560m. How to stop it or make other model as default? also how to list models available for inference?

text-generation-launcher --env 2024-09-18T09:23:46.627404Z INFO text_generation_launcher: Runtime environment: Target: x86_64-unknown-linux-gnu Cargo version: 1.79.0 Commit sha: db7e043ded45e14ed24188d5a963911c96049618 Docker label: sha-db7e043 nvidia-smi: Wed Sep 18 09:23:46 2024
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.104.12 Driver Version: 535.104.12 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA A100-SXM4-40GB On | 00000000:BD:00.0 Off | 0 | | N/A 33C P0 67W / 400W | 8831MiB / 40960MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| +---------------------------------------------------------------------------------------+ xpu-smi: N/A 2024-09-18T09:23:46.627578Z INFO text_generation_launcher: Args { model_id: "bigscience/bloom-560m", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: None, waiting_served_ratio: 0.3, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "tgi-server-5f75ff8bcb-mzxnd", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some( "/data", ), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-generation-inference.router", cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: true, max_client_batch_size: 4, lora_adapters: None, disable_usage_stats: false, disable_crash_reports: false, } 2024-09-18T09:23:46.627910Z INFO text_generation_launcher: Default max_input_tokens to 4095 2024-09-18T09:23:46.627920Z INFO text_generation_launcher: Default max_total_tokens to 4096 2024-09-18T09:23:46.627930Z INFO text_generation_launcher: Default max_batch_prefill_tokens to 4145 2024-09-18T09:23:46.627938Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32] 2024-09-18T09:23:46.628271Z INFO download: text_generation_launcher: Starting check and download process for bigscience/bloom-560m 2024-09-18T09:23:50.306621Z INFO text_generation_launcher: Files are already present on the host. Skipping download. 2024-09-18T09:23:51.133538Z INFO download: text_generation_launcher: Successfully downloaded weights for bigscience/bloom-560m

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

text-generation-launcher --model-id $model

Expected behavior

referring to the new model launched (llama3)

alvarobartt commented 1 month ago

Hi here @BeylasanRuzaiqi, so that's most likely because you didn't specify the model id or path properly; could you check that the $model variable contains the actual model name? Here's a breakdown with all the available arguments for text-generation-launcher.

Also to explore all the text-generation-inference compatible models you can explore the Hugging Face Hub with the text-generation-inference tag; or just check the supported models and hardware.

P.S. As your use-case is the latest Meta Llama model e.g. meta-llama/Meta-Llama-3.1-8B, note that you first need to accept the terms as it's a gated model; and then you also need to huggingface-cli login in advance or provide an authentication token via HF_TOKEN or HUGGING_FACE_HUB_TOKEN environment variables.

BeylasanRuzaiqi commented 1 month ago

Hi @alvarobartt , thanks for replying.

I did specify the llama model name to $model and also with tried $llama3 to avoid overwriting. Also I double checked and meta-llama/meta-llama3-8b-instruct is compatible.

As well as I have followed these steps as it is a gated model.

:(

alvarobartt commented 1 month ago

Could you try to just run text-generation-launcher --model-id meta-llama/Meta-Llama-3.1-8B-Instruct, not sure if that's something related to the variable handling?

Additionally, could you just send the ls -la output of the /data directory as it's used for the Hugging Face cache?

BeylasanRuzaiqi commented 1 month ago

Hi ! just did but upon calling the api using curl, it outputs this: curl 10.8.64.158:8080/info {"model_id":"bigscience/bloom-560m"," .......

alvarobartt commented 1 month ago

Are you running via Docker? Or are you inside an instance with TGI?

Could you clean /data, then run text-generation-launcher --model-id meta-llama/Meta-Llama-3.1-8B-Instruct, and after that share the logs? Thanks!

BeylasanRuzaiqi commented 1 month ago

Hi ! I am running via tgi docker image ghcr.io/huggingface/text-generation-inference:2.2.0

Logs: text-generation-launcher --model-id $model 2024-09-22T08:31:24.716417Z INFO text_generation_launcher: Args { model_id: "meta-llama/Meta-Llama-3-8B-Instruct", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: None, waiting_served_ratio: 0.3, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "tgi-server-5f75ff8bcb-mzxnd", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some( "/data", ), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-generation-inference.router", cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4, lora_adapters: None, disable_usage_stats: false, disable_crash_reports: false, } 2024-09-22T08:31:24.716581Z INFO text_generation_launcher: Model supports up to 8192 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using--max-batch-prefill-tokens=8242 --max-total-tokens=8192 --max-input-tokens=8191. 2024-09-22T08:31:24.716590Z INFO text_generation_launcher: Defaultmax_input_tokensto 4095 2024-09-22T08:31:24.716594Z INFO text_generation_launcher: Defaultmax_total_tokensto 4096 2024-09-22T08:31:24.716597Z INFO text_generation_launcher: Defaultmax_batch_prefill_tokensto 4145 2024-09-22T08:31:24.716601Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32] 2024-09-22T08:31:24.716782Z INFO download: text_generation_launcher: Starting check and download process for meta-llama/Meta-Llama-3-8B-Instruct 2024-09-22T08:31:28.834553Z INFO text_generation_launcher: Files are already present on the host. Skipping download. 2024-09-22T08:31:29.622112Z INFO download: text_generation_launcher: Successfully downloaded weights for meta-llama/Meta-Llama-3-8B-Instruct 2024-09-22T08:31:29.622641Z INFO shard-manager: text_generation_launcher: Starting shard rank=0 2024-09-22T08:31:39.650586Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-09-22T08:31:49.702261Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-09-22T08:31:59.798493Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0

then thread 'main' panicked at /usr/src/router/src/server.rs:1910:67: calledResult::unwrap()on anErrvalue: Os { code: 98, kind: AddrInUse, message: "Address already in use" } note: run withRUST_BACKTRACE=1environment variable to display a backtrace 2024-09-22T08:37:25.387376Z ERROR text_generation_launcher: Webserver Crashed 2024-09-22T08:37:25.387424Z INFO text_generation_launcher: Shutting down shards 2024-09-22T08:37:25.472076Z INFO shard-manager: text_generation_launcher: Terminating shard rank=0 2024-09-22T08:37:25.472140Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=0 2024-09-22T08:37:26.574326Z INFO shard-manager: text_generation_launcher: shard terminated rank=0 Error: WebserverFailed

alvarobartt commented 1 month ago

Hi @BeylasanRuzaiqi thanks for sharing the logs! So apparently the --model-id arg is properly picked now, but you have something else deployed within the same host being 0.0.0.0 on the 80 port; so you can try to:

Kill all the currently deployed text-generation-launcher instances if any with pkill -f text-generation-launcher
Ensure that the port 80 is not being used by any other service deployed in localhost

But as the traceback states Address already in use you will need to make sure that there are no other services using 0.0.0.0:80 at the same time.

BeylasanRuzaiqi commented 1 month ago

Hi @alvarobartt , 1- I did clean the data directory and killed all instances of text-generation-launcher 2- No other service is using port 80.

as well as I am getting the error of "waiting for shard to be ready" then after a while, WebServer:Failed.

alvarobartt commented 1 month ago

Could you share the full stack trace of the error @BeylasanRuzaiqi? Thanks in advance 🤗

BeylasanRuzaiqi commented 1 month ago

Hi @alvarobartt , thanks for your prompt replies. I scaled down the environment where it's not working and trying out with this YAML file:

`apiVersion: apps/v1 kind: Deployment metadata: name: text-generation-inference spec: replicas: 1 selector: matchLabels: app: text-generation-inference template: metadata: labels: app: text-generation-inference spec: containers:

name: text-generation-inference image: ghcr.io/huggingface/text-generation-inference:latest args:
- "--model-id"
- "$(MODEL_ID)"
- "--num-shard"
- "$(NUM_SHARD)"
- "--quantize"
- "$(QUANTIZE)" env:
- name: MODEL_ID value: "openai-community/gpt2"
- name: NUM_SHARD value: "1"
- name: QUANTIZE value: "bitsandbytes" resources: limits: nvidia.com/gpu: 1 # Adjust based on your GPU requirements volumeMounts:
- name: data mountPath: /data ports:
- containerPort: 80 volumes:
name: data emptyDir: {} # Replace with appropriate volume type (e.g., PVC)

apiVersion: v1 kind: Service metadata: name: text-generation-inference spec: selector: app: text-generation-inference ports:
- protocol: TCP port: 8080 targetPort: 80`

Because I am running the server in a k8s environment. Let me know if there is anything else to add (for now testing with a opensource non-gated model)

alvarobartt commented 1 month ago

Oh fair, I believe that you're missing the device shared memory as described in https://github.com/huggingface/text-generation-inference?tab=readme-ov-file#a-note-on-shared-memory-shm.

Here's how your updated Kubernetes manifest should look like:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: text-generation-inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: text-generation-inference
  template:
    metadata:
      labels:
        app: text-generation-inference
    spec:
      containers:
      - name: text-generation-inference
        image: ghcr.io/huggingface/text-generation-inference:latest
        args:
          - "--model-id"
          - "$(MODEL_ID)"
          - "--num-shard"
          - "$(NUM_SHARD)"
          - "--quantize"
          - "$(QUANTIZE)"
        env:
          - name: MODEL_ID
            value: "openai-community/gpt2"
          - name: NUM_SHARD
            value: "1"
          - name: QUANTIZE
            value: "bitsandbytes"
        resources:
          limits:
            nvidia.com/gpu: 1 # Adjust based on your GPU requirements
        volumeMounts:
          - mountPath: /dev/shm
            name: dshm
          - name: data
            mountPath: /data
        ports:
          - containerPort: 80
      volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 1Gi
        - name: data
          emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
  name: text-generation-inference
spec:
  selector:
    app: text-generation-inference
  ports:
  - protocol: TCP
    port: 8080
    targetPort: 80

BeylasanRuzaiqi commented 1 month ago

Hi @alvarobartt , I successfully deployed this application using this YAML file with adding a gated model and it worked however, inside the pod, the model is not running (waiting for shard to be ready log)

`text-generation-launcher --env 2024-09-23T11:58:54.783961Z INFO text_generation_launcher: Runtime environment: Target: x86_64-unknown-linux-gnu Cargo version: 1.80.0 Commit sha: 9263817c718db3a43791ff6b8d53355d6e8aa310 Docker label: sha-9263817 nvidia-smi: Mon Sep 23 11:58:54 2024
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.104.12 Driver Version: 535.104.12 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA A100-SXM4-40GB On | 00000000:B7:00.0 Off | 0 | | N/A 33C P0 68W / 400W | 2511MiB / 40960MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| +---------------------------------------------------------------------------------------+ xpu-smi: N/A 2024-09-23T11:58:54.784008Z INFO text_generation_launcher: Args { model_id: "meta-llama/Meta-Llama-3.1-8B", revision: None, validation_workers: 2, sharded: None, num_shard: Some( 1, ), quantize: Some( Eetq, ), speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: None, waiting_served_ratio: 0.3, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "text-generation-inference-f89c9cfb6-fdnz5", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: None, weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-generation-inference.router", cors_allow_origin: [], api_key: None, watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: true, max_client_batch_size: 4, lora_adapters: None, usage_stats: On, } 2024-09-23T11:58:54.784088Z INFO hf_hub: Token file not found "/data/token"
2024-09-23T11:58:54.784207Z INFO text_generation_launcher: Model supports up to 131072 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using --max-batch-prefill-tokens=131122 --max-total-tokens=131072 --max-input-tokens=131071. 2024-09-23T11:58:54.784216Z INFO text_generation_launcher: Using attention flashinfer - Prefix caching true 2024-09-23T11:58:54.784231Z INFO text_generation_launcher: Default max_input_tokens to 4095 2024-09-23T11:58:54.784235Z INFO text_generation_launcher: Default max_total_tokens to 4096 2024-09-23T11:58:54.784237Z INFO text_generation_launcher: Default max_batch_prefill_tokens to 4145 2024-09-23T11:58:54.784242Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32] 2024-09-23T11:58:54.784467Z INFO download: text_generation_launcher: Starting check and download process for meta-llama/Meta-Llama-3.1-8B 2024-09-23T11:58:58.788835Z INFO text_generation_launcher: Files are already present on the host. Skipping download. 2024-09-23T11:58:59.825233Z INFO download: text_generation_launcher: Successfully downloaded weights for meta-llama/Meta-Llama-3.1-8B 2024-09-23T11:58:59.825772Z INFO shard-manager: text_generation_launcher: Starting shard rank=0 2024-09-23T11:59:03.563083Z INFO text_generation_launcher: Using prefix caching = True 2024-09-23T11:59:03.563136Z INFO text_generation_launcher: Using Attention = flashinfer 2024-09-23T11:59:09.914622Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-09-23T11:59:19.938466Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-09-23T11:59:29.946165Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-09-23T11:59:39.960792Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-09-23T11:59:50.028923Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-09-23T12:00:00.089804Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-09-23T12:00:10.102785Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-09-23T12:00:20.124695Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-09-23T12:00:30.133097Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-09-23T12:00:40.148797Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-09-23T12:00:50.174522Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-09-23T12:01:00.270114Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-09-23T12:01:10.290190Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-09-23T12:01:20.347169Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-09-23T12:01:30.355719Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-09-23T12:01:40.404538Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-09-23T12:01:50.460307Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-09-23T12:02:00.499310Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-09-23T12:02:10.543806Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-09-23T12:02:20.551451Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-09-23T12:02:30.591694Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-09-23T12:02:40.618751Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-09-23T12:02:50.628042Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0`

then it fails after a couple of minutes.

alvarobartt commented 1 month ago

As it says Files are already present on the host. Skipping download., could you try cleaning the data mount that you're using i.e. /data? Additionally, if needed you can also increase the shared device memory.

BeylasanRuzaiqi commented 1 month ago

I tried that and this is the error I get,

2024-09-23T12:10:50.638248Z INFO text_generation_router::server: router/src/server.rs:2515: Serving revision 48d6d0fc4e02fb1269b36940650a1b7233035cbb of model meta-llama/Meta-Llama-3.1-8B 2024-09-23T12:10:55.714137Z INFO text_generation_router::server: router/src/server.rs:1943: Using config Some(Llama) 2024-09-23T12:10:56.359669Z WARN text_generation_router::server: router/src/server.rs:2090: Invalid hostname, defaulting to 0.0.0.0 2024-09-23T12:10:56.485234Z INFO text_generation_router::server: router/src/server.rs:2477: Connected 2024-09-23T12:11:41.494451Z INFO text_generation_router_v3::radix: backends/v3/src/radix.rs:108: Prefix 0 - Suffix 25 2024-09-23T12:11:41.494757Z ERROR batch{batch_size=1}:prefill:prefill{id=0 size=1}:prefill{id=0 size=1}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: error trying to connect: No such file or directory (os error 2) 2024-09-23T12:11:41.494856Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(0)}:clear_cache{batch_id=Some(0)}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: error trying to connect: No such file or directory (os error 2) 2024-09-23T12:11:41.494865Z ERROR generate{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(20), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None, adapter_id: None }}:generate:generate_stream:schedule:infer:send_error: text_generation_router_v3::backend: backends/v3/src/backend.rs:488: Request failed during generation: Server error: error trying to connect: No such file or directory (os error 2) 2024-09-23T12:11:54.107180Z INFO text_generation_router_v3::radix: backends/v3/src/radix.rs:108: Prefix 4 - Suffix 22 2024-09-23T12:11:54.107340Z ERROR batch{batch_size=1}:prefill:prefill{id=1 size=1}:prefill{id=1 size=1}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: error trying to connect: No such file or directory (os error 2) 2024-09-23T12:11:54.107396Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(1)}:clear_cache{batch_id=Some(1)}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: error trying to connect: No such file or directory (os error 2) 2024-09-23T12:11:54.107406Z ERROR generate{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(20), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None, adapter_id: None }}:generate:generate_stream:schedule:infer:send_error: text_generation_router_v3::backend: backends/v3/src/backend.rs:488: Request failed during generation: Server error: error trying to connect: No such file or directory (os error 2) 2024-09-23T12:11:57.098857Z INFO text_generation_router_v3::radix: backends/v3/src/radix.rs:108: Prefix 7 - Suffix 19 2024-09-23T12:11:57.099071Z ERROR batch{batch_size=1}:prefill:prefill{id=2 size=1}:prefill{id=2 size=1}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: error trying to connect: No such file or directory (os error 2) 2024-09-23T12:11:57.099140Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(2)}:clear_cache{batch_id=Some(2)}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: error trying to connect: No such file or directory (os error 2) 2024-09-23T12:11:57.099156Z ERROR generate{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(20), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None, adapter_id: None }}:generate:generate_stream:schedule:infer:send_error: text_generation_router_v3::backend: backends/v3/src/backend.rs:488: Request failed during generation: Server error: error trying to connect: No such file or directory (os error 2)

BeylasanRuzaiqi commented 1 month ago

great news @alvarobartt !! seems it was an issue with Llama3.1, when I tried deploying with Llama3-8b, it worked immediately.

This is the YAML file I have used (attached as txt): tgi-llama3.txt

alvarobartt commented 1 month ago

Great @BeylasanRuzaiqi happy to help! Do you mind closing the issue if already solved? Thanks 🤗

BeylasanRuzaiqi commented 1 month ago

Sure , Thanks @alvarobartt !

huggingface / text-generation-inference