Closed BeylasanRuzaiqi closed 1 month ago
Hi here @BeylasanRuzaiqi, so that's most likely because you didn't specify the model id or path properly; could you check that the $model
variable contains the actual model name? Here's a breakdown with all the available arguments for text-generation-launcher
.
Also to explore all the text-generation-inference
compatible models you can explore the Hugging Face Hub with the text-generation-inference
tag; or just check the supported models and hardware.
P.S. As your use-case is the latest Meta Llama model e.g. meta-llama/Meta-Llama-3.1-8B
, note that you first need to accept the terms as it's a gated model; and then you also need to huggingface-cli login
in advance or provide an authentication token via HF_TOKEN
or HUGGING_FACE_HUB_TOKEN
environment variables.
Hi @alvarobartt , thanks for replying.
I did specify the llama model name to $model
and also with tried $llama3
to avoid overwriting.
Also I double checked and meta-llama/meta-llama3-8b-instruct
is compatible.
As well as I have followed these steps as it is a gated model.
:(
Could you try to just run text-generation-launcher --model-id meta-llama/Meta-Llama-3.1-8B-Instruct
, not sure if that's something related to the variable handling?
Additionally, could you just send the ls -la
output of the /data
directory as it's used for the Hugging Face cache?
Hi ! just did but upon calling the api using curl, it outputs this:
curl 10.8.64.158:8080/info {"model_id":"bigscience/bloom-560m"," .......
Are you running via Docker? Or are you inside an instance with TGI?
Could you clean /data
, then run text-generation-launcher --model-id meta-llama/Meta-Llama-3.1-8B-Instruct
, and after that share the logs? Thanks!
Hi ! I am running via tgi docker image ghcr.io/huggingface/text-generation-inference:2.2.0
Logs:
text-generation-launcher --model-id $model 2024-09-22T08:31:24.716417Z INFO text_generation_launcher: Args { model_id: "meta-llama/Meta-Llama-3-8B-Instruct", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: None, waiting_served_ratio: 0.3, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "tgi-server-5f75ff8bcb-mzxnd", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some( "/data", ), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-generation-inference.router", cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4, lora_adapters: None, disable_usage_stats: false, disable_crash_reports: false, } 2024-09-22T08:31:24.716581Z INFO text_generation_launcher: Model supports up to 8192 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using
--max-batch-prefill-tokens=8242 --max-total-tokens=8192 --max-input-tokens=8191. 2024-09-22T08:31:24.716590Z INFO text_generation_launcher: Default
max_input_tokensto 4095 2024-09-22T08:31:24.716594Z INFO text_generation_launcher: Default
max_total_tokensto 4096 2024-09-22T08:31:24.716597Z INFO text_generation_launcher: Default
max_batch_prefill_tokensto 4145 2024-09-22T08:31:24.716601Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32] 2024-09-22T08:31:24.716782Z INFO download: text_generation_launcher: Starting check and download process for meta-llama/Meta-Llama-3-8B-Instruct 2024-09-22T08:31:28.834553Z INFO text_generation_launcher: Files are already present on the host. Skipping download. 2024-09-22T08:31:29.622112Z INFO download: text_generation_launcher: Successfully downloaded weights for meta-llama/Meta-Llama-3-8B-Instruct 2024-09-22T08:31:29.622641Z INFO shard-manager: text_generation_launcher: Starting shard rank=0 2024-09-22T08:31:39.650586Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-09-22T08:31:49.702261Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-09-22T08:31:59.798493Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
then
thread 'main' panicked at /usr/src/router/src/server.rs:1910:67: called
Result::unwrap()on an
Errvalue: Os { code: 98, kind: AddrInUse, message: "Address already in use" } note: run with
RUST_BACKTRACE=1environment variable to display a backtrace 2024-09-22T08:37:25.387376Z ERROR text_generation_launcher: Webserver Crashed 2024-09-22T08:37:25.387424Z INFO text_generation_launcher: Shutting down shards 2024-09-22T08:37:25.472076Z INFO shard-manager: text_generation_launcher: Terminating shard rank=0 2024-09-22T08:37:25.472140Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=0 2024-09-22T08:37:26.574326Z INFO shard-manager: text_generation_launcher: shard terminated rank=0 Error: WebserverFailed
Hi @BeylasanRuzaiqi thanks for sharing the logs! So apparently the --model-id
arg is properly picked now, but you have something else deployed within the same host being 0.0.0.0 on the 80 port; so you can try to:
text-generation-launcher
instances if any with pkill -f text-generation-launcher
But as the traceback states Address already in use
you will need to make sure that there are no other services using 0.0.0.0:80 at the same time.
Hi @alvarobartt ,
1- I did clean the data directory and killed all instances of text-generation-launcher
2- No other service is using port 80.
as well as I am getting the error of "waiting for shard to be ready" then after a while, WebServer:Failed
.
Could you share the full stack trace of the error @BeylasanRuzaiqi? Thanks in advance 🤗
Hi @alvarobartt , thanks for your prompt replies. I scaled down the environment where it's not working and trying out with this YAML file:
`apiVersion: apps/v1 kind: Deployment metadata: name: text-generation-inference spec: replicas: 1 selector: matchLabels: app: text-generation-inference template: metadata: labels: app: text-generation-inference spec: containers:
apiVersion: v1 kind: Service metadata: name: text-generation-inference spec: selector: app: text-generation-inference ports:
Because I am running the server in a k8s environment. Let me know if there is anything else to add (for now testing with a opensource non-gated model)
Oh fair, I believe that you're missing the device shared memory as described in https://github.com/huggingface/text-generation-inference?tab=readme-ov-file#a-note-on-shared-memory-shm.
Here's how your updated Kubernetes manifest should look like:
apiVersion: apps/v1
kind: Deployment
metadata:
name: text-generation-inference
spec:
replicas: 1
selector:
matchLabels:
app: text-generation-inference
template:
metadata:
labels:
app: text-generation-inference
spec:
containers:
- name: text-generation-inference
image: ghcr.io/huggingface/text-generation-inference:latest
args:
- "--model-id"
- "$(MODEL_ID)"
- "--num-shard"
- "$(NUM_SHARD)"
- "--quantize"
- "$(QUANTIZE)"
env:
- name: MODEL_ID
value: "openai-community/gpt2"
- name: NUM_SHARD
value: "1"
- name: QUANTIZE
value: "bitsandbytes"
resources:
limits:
nvidia.com/gpu: 1 # Adjust based on your GPU requirements
volumeMounts:
- mountPath: /dev/shm
name: dshm
- name: data
mountPath: /data
ports:
- containerPort: 80
volumes:
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 1Gi
- name: data
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: text-generation-inference
spec:
selector:
app: text-generation-inference
ports:
- protocol: TCP
port: 8080
targetPort: 80
Hi @alvarobartt , I successfully deployed this application using this YAML file with adding a gated model and it worked however, inside the pod, the model is not running (waiting for shard to be ready log)
`text-generation-launcher --env
2024-09-23T11:58:54.783961Z INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.80.0
Commit sha: 9263817c718db3a43791ff6b8d53355d6e8aa310
Docker label: sha-9263817
nvidia-smi:
Mon Sep 23 11:58:54 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12 Driver Version: 535.104.12 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-40GB On | 00000000:B7:00.0 Off | 0 |
| N/A 33C P0 68W / 400W | 2511MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+
xpu-smi:
N/A
2024-09-23T11:58:54.784008Z INFO text_generation_launcher: Args {
model_id: "meta-llama/Meta-Llama-3.1-8B",
revision: None,
validation_workers: 2,
sharded: None,
num_shard: Some(
1,
),
quantize: Some(
Eetq,
),
speculate: None,
dtype: None,
trust_remote_code: false,
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_top_n_tokens: 5,
max_input_tokens: None,
max_input_length: None,
max_total_tokens: None,
waiting_served_ratio: 0.3,
max_batch_prefill_tokens: None,
max_batch_total_tokens: None,
max_waiting_tokens: 20,
max_batch_size: None,
cuda_graphs: None,
hostname: "text-generation-inference-f89c9cfb6-fdnz5",
port: 80,
shard_uds_path: "/tmp/text-generation-server",
master_addr: "localhost",
master_port: 29500,
huggingface_hub_cache: None,
weights_cache_override: None,
disable_custom_kernels: false,
cuda_memory_fraction: 1.0,
rope_scaling: None,
rope_factor: None,
json_output: false,
otlp_endpoint: None,
otlp_service_name: "text-generation-inference.router",
cors_allow_origin: [],
api_key: None,
watermark_gamma: None,
watermark_delta: None,
ngrok: false,
ngrok_authtoken: None,
ngrok_edge: None,
tokenizer_config_path: None,
disable_grammar_support: false,
env: true,
max_client_batch_size: 4,
lora_adapters: None,
usage_stats: On,
}
2024-09-23T11:58:54.784088Z INFO hf_hub: Token file not found "/data/token"
2024-09-23T11:58:54.784207Z INFO text_generation_launcher: Model supports up to 131072 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using --max-batch-prefill-tokens=131122 --max-total-tokens=131072 --max-input-tokens=131071
.
2024-09-23T11:58:54.784216Z INFO text_generation_launcher: Using attention flashinfer - Prefix caching true
2024-09-23T11:58:54.784231Z INFO text_generation_launcher: Default max_input_tokens
to 4095
2024-09-23T11:58:54.784235Z INFO text_generation_launcher: Default max_total_tokens
to 4096
2024-09-23T11:58:54.784237Z INFO text_generation_launcher: Default max_batch_prefill_tokens
to 4145
2024-09-23T11:58:54.784242Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-09-23T11:58:54.784467Z INFO download: text_generation_launcher: Starting check and download process for meta-llama/Meta-Llama-3.1-8B
2024-09-23T11:58:58.788835Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2024-09-23T11:58:59.825233Z INFO download: text_generation_launcher: Successfully downloaded weights for meta-llama/Meta-Llama-3.1-8B
2024-09-23T11:58:59.825772Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-09-23T11:59:03.563083Z INFO text_generation_launcher: Using prefix caching = True
2024-09-23T11:59:03.563136Z INFO text_generation_launcher: Using Attention = flashinfer
2024-09-23T11:59:09.914622Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-09-23T11:59:19.938466Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-09-23T11:59:29.946165Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-09-23T11:59:39.960792Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-09-23T11:59:50.028923Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-09-23T12:00:00.089804Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-09-23T12:00:10.102785Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-09-23T12:00:20.124695Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-09-23T12:00:30.133097Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-09-23T12:00:40.148797Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-09-23T12:00:50.174522Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-09-23T12:01:00.270114Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-09-23T12:01:10.290190Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-09-23T12:01:20.347169Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-09-23T12:01:30.355719Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-09-23T12:01:40.404538Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-09-23T12:01:50.460307Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-09-23T12:02:00.499310Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-09-23T12:02:10.543806Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-09-23T12:02:20.551451Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-09-23T12:02:30.591694Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-09-23T12:02:40.618751Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-09-23T12:02:50.628042Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0`
then it fails after a couple of minutes.
As it says Files are already present on the host. Skipping download.
, could you try cleaning the data mount that you're using i.e. /data
? Additionally, if needed you can also increase the shared device memory.
I tried that and this is the error I get,
2024-09-23T12:10:50.638248Z INFO text_generation_router::server: router/src/server.rs:2515: Serving revision 48d6d0fc4e02fb1269b36940650a1b7233035cbb of model meta-llama/Meta-Llama-3.1-8B 2024-09-23T12:10:55.714137Z INFO text_generation_router::server: router/src/server.rs:1943: Using config Some(Llama) 2024-09-23T12:10:56.359669Z WARN text_generation_router::server: router/src/server.rs:2090: Invalid hostname, defaulting to 0.0.0.0 2024-09-23T12:10:56.485234Z INFO text_generation_router::server: router/src/server.rs:2477: Connected 2024-09-23T12:11:41.494451Z INFO text_generation_router_v3::radix: backends/v3/src/radix.rs:108: Prefix 0 - Suffix 25 2024-09-23T12:11:41.494757Z ERROR batch{batch_size=1}:prefill:prefill{id=0 size=1}:prefill{id=0 size=1}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: error trying to connect: No such file or directory (os error 2) 2024-09-23T12:11:41.494856Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(0)}:clear_cache{batch_id=Some(0)}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: error trying to connect: No such file or directory (os error 2) 2024-09-23T12:11:41.494865Z ERROR generate{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(20), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None, adapter_id: None }}:generate:generate_stream:schedule:infer:send_error: text_generation_router_v3::backend: backends/v3/src/backend.rs:488: Request failed during generation: Server error: error trying to connect: No such file or directory (os error 2) 2024-09-23T12:11:54.107180Z INFO text_generation_router_v3::radix: backends/v3/src/radix.rs:108: Prefix 4 - Suffix 22 2024-09-23T12:11:54.107340Z ERROR batch{batch_size=1}:prefill:prefill{id=1 size=1}:prefill{id=1 size=1}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: error trying to connect: No such file or directory (os error 2) 2024-09-23T12:11:54.107396Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(1)}:clear_cache{batch_id=Some(1)}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: error trying to connect: No such file or directory (os error 2) 2024-09-23T12:11:54.107406Z ERROR generate{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(20), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None, adapter_id: None }}:generate:generate_stream:schedule:infer:send_error: text_generation_router_v3::backend: backends/v3/src/backend.rs:488: Request failed during generation: Server error: error trying to connect: No such file or directory (os error 2) 2024-09-23T12:11:57.098857Z INFO text_generation_router_v3::radix: backends/v3/src/radix.rs:108: Prefix 7 - Suffix 19 2024-09-23T12:11:57.099071Z ERROR batch{batch_size=1}:prefill:prefill{id=2 size=1}:prefill{id=2 size=1}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: error trying to connect: No such file or directory (os error 2) 2024-09-23T12:11:57.099140Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(2)}:clear_cache{batch_id=Some(2)}: text_generation_router_v3::client: backends/v3/src/client/mod.rs:54: Server error: error trying to connect: No such file or directory (os error 2) 2024-09-23T12:11:57.099156Z ERROR generate{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(20), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None, adapter_id: None }}:generate:generate_stream:schedule:infer:send_error: text_generation_router_v3::backend: backends/v3/src/backend.rs:488: Request failed during generation: Server error: error trying to connect: No such file or directory (os error 2)
great news @alvarobartt !! seems it was an issue with Llama3.1, when I tried deploying with Llama3-8b, it worked immediately.
This is the YAML file I have used (attached as txt): tgi-llama3.txt
Great @BeylasanRuzaiqi happy to help! Do you mind closing the issue if already solved? Thanks 🤗
Sure , Thanks @alvarobartt !
System Info
I have deployed TGI on a nvidia GPU successfully but when when downloading another model from HuggingFace, it keeps referring the model bigscience/bloom-560m. How to stop it or make other model as default? also how to list models available for inference?
text-generation-launcher --env 2024-09-18T09:23:46.627404Z INFO text_generation_launcher: Runtime environment: Target: x86_64-unknown-linux-gnu Cargo version: 1.79.0 Commit sha: db7e043ded45e14ed24188d5a963911c96049618 Docker label: sha-db7e043 nvidia-smi: Wed Sep 18 09:23:46 2024
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.104.12 Driver Version: 535.104.12 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA A100-SXM4-40GB On | 00000000:BD:00.0 Off | 0 | | N/A 33C P0 67W / 400W | 8831MiB / 40960MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| +---------------------------------------------------------------------------------------+ xpu-smi: N/A 2024-09-18T09:23:46.627578Z INFO text_generation_launcher: Args { model_id: "bigscience/bloom-560m", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: None, waiting_served_ratio: 0.3, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "tgi-server-5f75ff8bcb-mzxnd", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some( "/data", ), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-generation-inference.router", cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: true, max_client_batch_size: 4, lora_adapters: None, disable_usage_stats: false, disable_crash_reports: false, } 2024-09-18T09:23:46.627910Z INFO text_generation_launcher: Default
max_input_tokens
to 4095 2024-09-18T09:23:46.627920Z INFO text_generation_launcher: Defaultmax_total_tokens
to 4096 2024-09-18T09:23:46.627930Z INFO text_generation_launcher: Defaultmax_batch_prefill_tokens
to 4145 2024-09-18T09:23:46.627938Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32] 2024-09-18T09:23:46.628271Z INFO download: text_generation_launcher: Starting check and download process for bigscience/bloom-560m 2024-09-18T09:23:50.306621Z INFO text_generation_launcher: Files are already present on the host. Skipping download. 2024-09-18T09:23:51.133538Z INFO download: text_generation_launcher: Successfully downloaded weights for bigscience/bloom-560mInformation
Tasks
Reproduction
text-generation-launcher --model-id $model
Expected behavior
referring to the new model launched (llama3)