huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.73k stars 1.01k forks source link

Shard process was signaled to shutdown with signal 4 rank=0 Error: ShardCannotStart #2238

Closed fhamborg closed 1 month ago

fhamborg commented 1 month ago

System Info

OS: Debian 6.1.85-1 NVIDIA-SMI 550.54.15
Driver Version: 550.54.15
CUDA Version: 12.4 Card: NVIDIA RTX A6000

Information

Tasks

Reproduction

I just copied the example from the readme.md, i.e.:

model=HuggingFaceH4/zephyr-7b-beta
# share a volume with the Docker container to avoid downloading weights every run
volume=$PWD/data

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
    ghcr.io/huggingface/text-generation-inference:2.0.4 --model-id $model

Stored that in a file named test.sh and executed it (bash test.sh).

This gives me the following output:

Unable to find image 'ghcr.io/huggingface/text-generation-inference:2.0.4' locally
2.0.4: Pulling from huggingface/text-generation-inference
aece8493d397: Already exists 
45f7ea5367fe: Already exists 
3d97a47c3c73: Already exists 
12cd4d19752f: Already exists 
da5a484f9d74: Already exists 
4f4fb700ef54: Pull complete 
7d26cf3e2c1d: Pull complete 
16050f987670: Pull complete 
24e4e43c7707: Pull complete 
6002d334ba4d: Pull complete 
0e2fe306ddc0: Pull complete 
ebd769663fc2: Pull complete 
b361fafefde0: Pull complete 
a14cfc4df75e: Pull complete 
3bacf1fc037c: Pull complete 
5c120f02d065: Pull complete 
2187b9e17989: Pull complete 
8479f45b54b0: Pull complete 
8c36075d2142: Pull complete 
bd4c88b0f1b8: Pull complete 
eaf68117c686: Pull complete 
86fd5f37c031: Pull complete 
5985ba83bade: Pull complete 
87e047c024f8: Pull complete 
87ed49fc4b0b: Pull complete 
391330ef1138: Pull complete 
161407bbdce6: Pull complete 
20a19c172ec6: Pull complete 
dc6d45820d12: Pull complete 
Digest: sha256:072675d536f695ac5e4ace15c594742a12dae047bcb0eacfce934665141e6585
Status: Downloaded newer image for ghcr.io/huggingface/text-generation-inference:2.0.4
2024-07-16T13:13:33.251940Z  INFO text_generation_launcher: Args {
    model_id: "HuggingFaceH4/zephyr-7b-beta",
    revision: None,
    validation_workers: 2,
    sharded: None,
    num_shard: None,
    quantize: None,
    speculate: None,
    dtype: None,
    trust_remote_code: false,
    max_concurrent_requests: 128,
    max_best_of: 2,
    max_stop_sequences: 4,
    max_top_n_tokens: 5,
    max_input_tokens: None,
    max_input_length: None,
    max_total_tokens: None,
    waiting_served_ratio: 0.3,
    max_batch_prefill_tokens: None,
    max_batch_total_tokens: None,
    max_waiting_tokens: 20,
    max_batch_size: None,
    cuda_graphs: None,
    hostname: "ce142d8298a7",
    port: 80,
    shard_uds_path: "/tmp/text-generation-server",
    master_addr: "localhost",
    master_port: 29500,
    huggingface_hub_cache: Some(
        "/data",
    ),
    weights_cache_override: None,
    disable_custom_kernels: false,
    cuda_memory_fraction: 1.0,
    rope_scaling: None,
    rope_factor: None,
    json_output: false,
    otlp_endpoint: None,
    cors_allow_origin: [],
    watermark_gamma: None,
    watermark_delta: None,
    ngrok: false,
    ngrok_authtoken: None,
    ngrok_edge: None,
    tokenizer_config_path: None,
    disable_grammar_support: false,
    env: false,
    max_client_batch_size: 4,
}
2024-07-16T13:13:33.252051Z  INFO hf_hub: Token file not found "/root/.cache/huggingface/token"    
2024-07-16T13:13:33.593053Z  INFO text_generation_launcher: Model supports up to 32768 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using `--max-batch-prefill-tokens=32818 --max-total-tokens=32768 --max-input-tokens=32767`.
2024-07-16T13:13:33.593068Z  INFO text_generation_launcher: Default `max_input_tokens` to 4095
2024-07-16T13:13:33.593070Z  INFO text_generation_launcher: Default `max_total_tokens` to 4096
2024-07-16T13:13:33.593071Z  INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 4145
2024-07-16T13:13:33.593073Z  INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-07-16T13:13:33.593160Z  INFO download: text_generation_launcher: Starting download process.
2024-07-16T13:13:35.915930Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2024-07-16T13:13:36.396024Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2024-07-16T13:13:36.396190Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-07-16T13:13:39.099398Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
 rank=0
2024-07-16T13:13:39.099422Z ERROR shard-manager: text_generation_launcher: Shard process was signaled to shutdown with signal 4 rank=0
Error: ShardCannotStart
2024-07-16T13:13:39.198233Z ERROR text_generation_launcher: Shard 0 failed to start
2024-07-16T13:13:39.198252Z  INFO text_generation_launcher: Shutting down shards

I also tried using more recent versions of the docker image, e.g., ghcr.io/huggingface/text-generation-inference:2.1.1 but the Error: ShardCannotStart always occurs.

Expected behavior

Should run as described in the readme, e.g., not get this error.

ErikKaum commented 1 month ago

Hi @fhamborg 👋

Thanks for reporting. That's an interesting one, I'm unfortunately not able to reproduce on my machine. Also weird that there's no std err output from the shard Shard complete standard error output:.

Do you @OlivierDehaene have any hunch what could cause this?

fhamborg commented 1 month ago

Kindly let me know in case there's any other information that I could provide :)

ErikKaum commented 1 month ago

Thank you! :)

Any information on what's going on the GPU side while this crashes would be helpful. Something along the lines of this:

nvidia-smi --query-gpu=name,pci.bus_id,driver_version,pstate,pcie.link.gen.max,pcie.link.gen.gpucurrent,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used,reset_status.reset_required,reset_status.drain_and_reset_recommended,compute_cap,ecc.errors.corrected.volatile.total,mig.mode.current,power.draw.instant,power.limit
fhamborg commented 1 month ago

Sure, here's the output according to your command executed from just before starting the example script from the readme.md until it finishes (executed 5 times per second). I guess you can see that the GPU is never actually employed, which is inline with what I notice when using nvtop. Note that I verified that CUDA works on the machine - both when running, for example, huggingface transformers with pytorch directly on the machine, as well as in other docker containers.

NVIDIA RTX A6000, 00000000:01:00.0, 550.54.15, P8, 4, 1, 44, 0 %, 0 %, 49140 MiB, 48668 MiB, 1 MiB, No, [N/A], 8.6, [N/A], [N/A], 18.51 W, 300.00 W
NVIDIA RTX A6000, 00000000:01:00.0, 550.54.15, P8, 4, 1, 44, 0 %, 0 %, 49140 MiB, 48668 MiB, 1 MiB, No, [N/A], 8.6, [N/A], [N/A], 18.60 W, 300.00 W
NVIDIA RTX A6000, 00000000:01:00.0, 550.54.15, P8, 4, 1, 44, 0 %, 0 %, 49140 MiB, 48668 MiB, 1 MiB, No, [N/A], 8.6, [N/A], [N/A], 18.48 W, 300.00 W
NVIDIA RTX A6000, 00000000:01:00.0, 550.54.15, P8, 4, 1, 44, 0 %, 0 %, 49140 MiB, 48668 MiB, 1 MiB, No, [N/A], 8.6, [N/A], [N/A], 18.47 W, 300.00 W
NVIDIA RTX A6000, 00000000:01:00.0, 550.54.15, P8, 4, 1, 44, 0 %, 0 %, 49140 MiB, 48668 MiB, 1 MiB, No, [N/A], 8.6, [N/A], [N/A], 18.34 W, 300.00 W
NVIDIA RTX A6000, 00000000:01:00.0, 550.54.15, P8, 4, 1, 44, 0 %, 0 %, 49140 MiB, 48668 MiB, 1 MiB, No, [N/A], 8.6, [N/A], [N/A], 18.41 W, 300.00 W
NVIDIA RTX A6000, 00000000:01:00.0, 550.54.15, P8, 4, 1, 44, 0 %, 0 %, 49140 MiB, 48668 MiB, 1 MiB, No, [N/A], 8.6, [N/A], [N/A], 18.11 W, 300.00 W
NVIDIA RTX A6000, 00000000:01:00.0, 550.54.15, P8, 4, 1, 44, 0 %, 0 %, 49140 MiB, 48668 MiB, 1 MiB, No, [N/A], 8.6, [N/A], [N/A], 18.04 W, 300.00 W
NVIDIA RTX A6000, 00000000:01:00.0, 550.54.15, P8, 4, 1, 44, 0 %, 0 %, 49140 MiB, 48665 MiB, 3 MiB, No, [N/A], 8.6, [N/A], [N/A], 26.37 W, 300.00 W
NVIDIA RTX A6000, 00000000:01:00.0, 550.54.15, P8, 4, 1, 44, 0 %, 0 %, 49140 MiB, 48665 MiB, 3 MiB, No, [N/A], 8.6, [N/A], [N/A], 26.26 W, 300.00 W
NVIDIA RTX A6000, 00000000:01:00.0, 550.54.15, P8, 4, 1, 44, 0 %, 0 %, 49140 MiB, 48665 MiB, 3 MiB, No, [N/A], 8.6, [N/A], [N/A], 18.21 W, 300.00 W
NVIDIA RTX A6000, 00000000:01:00.0, 550.54.15, P8, 4, 1, 44, 0 %, 0 %, 49140 MiB, 48668 MiB, 1 MiB, No, [N/A], 8.6, [N/A], [N/A], 26.49 W, 300.00 W
NVIDIA RTX A6000, 00000000:01:00.0, 550.54.15, P8, 4, 1, 44, 0 %, 0 %, 49140 MiB, 48668 MiB, 1 MiB, No, [N/A], 8.6, [N/A], [N/A], 26.27 W, 300.00 W
NVIDIA RTX A6000, 00000000:01:00.0, 550.54.15, P8, 4, 1, 44, 0 %, 0 %, 49140 MiB, 48665 MiB, 3 MiB, No, [N/A], 8.6, [N/A], [N/A], 25.20 W, 300.00 W
NVIDIA RTX A6000, 00000000:01:00.0, 550.54.15, P8, 4, 1, 44, 0 %, 0 %, 49140 MiB, 48665 MiB, 3 MiB, No, [N/A], 8.6, [N/A], [N/A], 26.26 W, 300.00 W
NVIDIA RTX A6000, 00000000:01:00.0, 550.54.15, P8, 4, 1, 44, 0 %, 0 %, 49140 MiB, 48665 MiB, 3 MiB, No, [N/A], 8.6, [N/A], [N/A], 20.98 W, 300.00 W
NVIDIA RTX A6000, 00000000:01:00.0, 550.54.15, P8, 4, 1, 44, 0 %, 0 %, 49140 MiB, 48668 MiB, 1 MiB, No, [N/A], 8.6, [N/A], [N/A], 26.25 W, 300.00 W
NVIDIA RTX A6000, 00000000:01:00.0, 550.54.15, P8, 4, 1, 44, 0 %, 0 %, 49140 MiB, 48668 MiB, 1 MiB, No, [N/A], 8.6, [N/A], [N/A], 26.20 W, 300.00 W
NVIDIA RTX A6000, 00000000:01:00.0, 550.54.15, P8, 4, 1, 44, 0 %, 0 %, 49140 MiB, 48668 MiB, 1 MiB, No, [N/A], 8.6, [N/A], [N/A], 18.21 W, 300.00 W
NVIDIA RTX A6000, 00000000:01:00.0, 550.54.15, P8, 4, 1, 44, 0 %, 0 %, 49140 MiB, 48668 MiB, 1 MiB, No, [N/A], 8.6, [N/A], [N/A], 18.30 W, 300.00 W
NVIDIA RTX A6000, 00000000:01:00.0, 550.54.15, P8, 4, 1, 44, 0 %, 0 %, 49140 MiB, 48668 MiB, 1 MiB, No, [N/A], 8.6, [N/A], [N/A], 18.33 W, 300.00 W
NVIDIA RTX A6000, 00000000:01:00.0, 550.54.15, P8, 4, 1, 44, 0 %, 0 %, 49140 MiB, 48668 MiB, 1 MiB, No, [N/A], 8.6, [N/A], [N/A], 18.28 W, 300.00 W
NVIDIA RTX A6000, 00000000:01:00.0, 550.54.15, P8, 4, 1, 44, 0 %, 0 %, 49140 MiB, 48668 MiB, 1 MiB, No, [N/A], 8.6, [N/A], [N/A], 18.13 W, 300.00 W
NVIDIA RTX A6000, 00000000:01:00.0, 550.54.15, P8, 4, 1, 44, 0 %, 0 %, 49140 MiB, 48668 MiB, 1 MiB, No, [N/A], 8.6, [N/A], [N/A], 18.15 W, 300.00 W
NVIDIA RTX A6000, 00000000:01:00.0, 550.54.15, P8, 4, 1, 44, 0 %, 0 %, 49140 MiB, 48668 MiB, 1 MiB, No, [N/A], 8.6, [N/A], [N/A], 18.17 W, 300.00 W
NVIDIA RTX A6000, 00000000:01:00.0, 550.54.15, P8, 4, 1, 44, 0 %, 0 %, 49140 MiB, 48668 MiB, 1 MiB, No, [N/A], 8.6, [N/A], [N/A], 18.27 W, 300.00 W
NVIDIA RTX A6000, 00000000:01:00.0, 550.54.15, P8, 4, 1, 44, 0 %, 0 %, 49140 MiB, 48668 MiB, 1 MiB, No, [N/A], 8.6, [N/A], [N/A], 18.36 W, 300.00 W
NVIDIA RTX A6000, 00000000:01:00.0, 550.54.15, P8, 4, 1, 44, 0 %, 0 %, 49140 MiB, 48668 MiB, 1 MiB, No, [N/A], 8.6, [N/A], [N/A], 18.14 W, 300.00 W
NVIDIA RTX A6000, 00000000:01:00.0, 550.54.15, P8, 4, 1, 44, 0 %, 0 %, 49140 MiB, 48668 MiB, 1 MiB, No, [N/A], 8.6, [N/A], [N/A], 18.28 W, 300.00 W
NVIDIA RTX A6000, 00000000:01:00.0, 550.54.15, P8, 4, 1, 44, 0 %, 0 %, 49140 MiB, 48668 MiB, 1 MiB, No, [N/A], 8.6, [N/A], [N/A], 18.55 W, 300.00 W
ErikKaum commented 1 month ago

Thanks for the added info 👍

Most definitely not an easy debug. I don't have an RTX A6000 at the moment at my disposal but I think replicating your environment would be the best move. Unfortunately bandwidth-wise I won't be able to look at this for a few weeks. My apologies.

If however you find anything that might help point us in the right direction feel free to update here.

fhamborg commented 1 month ago

Sure! Do you have any idea how I could get more logging information or other helpful information out of the docker container?

Hugoch commented 1 month ago

@fhamborg , you can try running with some more logs:

model=HuggingFaceH4/zephyr-7b-beta
# share a volume with the Docker container to avoid downloading weights every run
volume=$PWD/data

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data -e LOG_LEVEL=debug,text_generation_router=debug ghcr.io/huggingface/text-generation-inference:2.0.4 --model-id $model
fhamborg commented 1 month ago

Thanks! I think there's not really new relevant information:

2024-07-22T08:04:11.562343Z  INFO text_generation_launcher: Args {
    model_id: "VAGOsolutions/Llama-3-SauerkrautLM-70b-Instruct",
    revision: None,
    validation_workers: 2,
    sharded: None,
    num_shard: None,
    quantize: Some(
        BitsandbytesNF4,
    ),
    speculate: None,
    dtype: None,
    trust_remote_code: false,
    max_concurrent_requests: 128,
    max_best_of: 2,
    max_stop_sequences: 4,
    max_top_n_tokens: 5,
    max_input_tokens: None,
    max_input_length: None,
    max_total_tokens: None,
    waiting_served_ratio: 0.3,
    max_batch_prefill_tokens: None,
    max_batch_total_tokens: None,
    max_waiting_tokens: 20,
    max_batch_size: None,
    cuda_graphs: None,
    hostname: "808aa2c597ef",
    port: 80,
    shard_uds_path: "/tmp/text-generation-server",
    master_addr: "localhost",
    master_port: 29500,
    huggingface_hub_cache: Some(
        "/data",
    ),
    weights_cache_override: None,
    disable_custom_kernels: false,
    cuda_memory_fraction: 1.0,
    rope_scaling: None,
    rope_factor: None,
    json_output: false,
    otlp_endpoint: None,
    otlp_service_name: "text-generation-inference.router",
    cors_allow_origin: [],
    watermark_gamma: None,
    watermark_delta: None,
    ngrok: false,
    ngrok_authtoken: None,
    ngrok_edge: None,
    tokenizer_config_path: None,
    disable_grammar_support: false,
    env: false,
    max_client_batch_size: 4,
    lora_adapters: None,
}
2024-07-22T08:04:11.562407Z  INFO hf_hub: Token file not found "/root/.cache/huggingface/token"    
2024-07-22T08:04:11.595793Z DEBUG ureq::stream: connecting to huggingface.co:443 at 18.154.63.62:443    
2024-07-22T08:04:11.621153Z DEBUG rustls::client::hs: No cached session for DnsName("huggingface.co")    
2024-07-22T08:04:11.621233Z DEBUG rustls::client::hs: Not resuming any session    
2024-07-22T08:04:11.636117Z DEBUG rustls::client::hs: Using ciphersuite TLS13_AES_128_GCM_SHA256    
2024-07-22T08:04:11.636135Z DEBUG rustls::client::tls13: Not resuming    
2024-07-22T08:04:11.636198Z DEBUG rustls::client::tls13: TLS1.3 encrypted extensions: [ServerNameAck]    
2024-07-22T08:04:11.636204Z DEBUG rustls::client::hs: ALPN protocol is None    
2024-07-22T08:04:11.637402Z DEBUG ureq::stream: created stream: Stream(RustlsStream)    
2024-07-22T08:04:11.637410Z DEBUG ureq::unit: sending request GET https://huggingface.co/VAGOsolutions/Llama-3-SauerkrautLM-70b-Instruct/resolve/main/config.json    
2024-07-22T08:04:11.637415Z DEBUG ureq::unit: writing prelude: GET /VAGOsolutions/Llama-3-SauerkrautLM-70b-Instruct/resolve/main/config.json HTTP/1.1
Host: huggingface.co
Accept: */*
User-Agent: unkown/None; hf-hub/0.3.2; rust/unknown
Range: bytes=0-0    
2024-07-22T08:04:11.946450Z DEBUG ureq::response: Body entirely buffered (length: 1)    
2024-07-22T08:04:11.946468Z DEBUG ureq::pool: adding stream to pool: https|huggingface.co|443 -> Stream(RustlsStream)    
2024-07-22T08:04:11.946474Z DEBUG ureq::unit: response 206 to GET https://huggingface.co/VAGOsolutions/Llama-3-SauerkrautLM-70b-Instruct/resolve/main/config.json    
2024-07-22T08:04:11.947251Z DEBUG ureq::stream: connecting to huggingface.co:443 at 18.154.63.57:443    
2024-07-22T08:04:11.959248Z DEBUG rustls::client::hs: Resuming session    
2024-07-22T08:04:11.972000Z DEBUG rustls::client::hs: Using ciphersuite TLS13_AES_128_GCM_SHA256    
2024-07-22T08:04:11.972011Z DEBUG rustls::client::tls13: Resuming using PSK    
2024-07-22T08:04:11.972069Z DEBUG rustls::client::tls13: TLS1.3 encrypted extensions: []    
2024-07-22T08:04:11.972076Z DEBUG rustls::client::hs: ALPN protocol is None    
2024-07-22T08:04:11.972112Z DEBUG ureq::stream: created stream: Stream(RustlsStream)    
2024-07-22T08:04:11.972119Z DEBUG ureq::unit: sending request GET https://huggingface.co/VAGOsolutions/Llama-3-SauerkrautLM-70b-Instruct/resolve/main/config.json    
2024-07-22T08:04:11.972123Z DEBUG ureq::unit: writing prelude: GET /VAGOsolutions/Llama-3-SauerkrautLM-70b-Instruct/resolve/main/config.json HTTP/1.1
Host: huggingface.co
Accept: */*
User-Agent: unkown/None; hf-hub/0.3.2; rust/unknown
accept-encoding: gzip    
2024-07-22T08:04:12.278568Z DEBUG ureq::response: Body entirely buffered (length: 719)    
2024-07-22T08:04:12.278581Z DEBUG ureq::pool: adding stream to pool: https|huggingface.co|443 -> Stream(RustlsStream)    
2024-07-22T08:04:12.278585Z DEBUG ureq::unit: response 200 to GET https://huggingface.co/VAGOsolutions/Llama-3-SauerkrautLM-70b-Instruct/resolve/main/config.json    
2024-07-22T08:04:12.278762Z DEBUG ureq::stream: dropping stream: Stream(RustlsStream)    
2024-07-22T08:04:12.278795Z DEBUG ureq::stream: dropping stream: Stream(RustlsStream)    
2024-07-22T08:04:12.278826Z  INFO text_generation_launcher: Model supports up to 8192 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using `--max-batch-prefill-tokens=8242 --max-total-tokens=8192 --max-input-tokens=8191`.
2024-07-22T08:04:12.278829Z  INFO text_generation_launcher: Default `max_input_tokens` to 4095
2024-07-22T08:04:12.278831Z  INFO text_generation_launcher: Default `max_total_tokens` to 4096
2024-07-22T08:04:12.278832Z  INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 4145
2024-07-22T08:04:12.278834Z  INFO text_generation_launcher: Bitsandbytes doesn't work with cuda graphs, deactivating them
2024-07-22T08:04:12.278918Z  INFO download: text_generation_launcher: Starting check and download process for VAGOsolutions/Llama-3-SauerkrautLM-70b-Instruct
2024-07-22T08:04:13.298423Z  INFO text_generation_launcher: Detected system cuda
2024-07-22T08:04:14.525622Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.
2024-07-22T08:04:14.983473Z  INFO download: text_generation_launcher: Successfully downloaded weights for VAGOsolutions/Llama-3-SauerkrautLM-70b-Instruct
2024-07-22T08:04:14.983636Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-07-22T08:04:16.122739Z  INFO text_generation_launcher: Detected system cuda
2024-07-22T08:04:17.216548Z DEBUG text_generation_launcher: WARNING 07-22 08:04:17 ray_utils.py:46] Failed to import Ray with ModuleNotFoundError("No module named 'ray'"). For distributed inference, please install Ray with `pip install ray`.
2024-07-22T08:04:17.686637Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
 rank=0
2024-07-22T08:04:17.686666Z ERROR shard-manager: text_generation_launcher: Shard process was signaled to shutdown with signal 4 rank=0
Error: ShardCannotStart
2024-07-22T08:04:17.786038Z ERROR text_generation_launcher: Shard 0 failed to start
2024-07-22T08:04:17.786050Z  INFO text_generation_launcher: Shutting down shards
fhamborg commented 1 month ago

The only thing that comes to my mind is

2024-07-22T08:04:17.216548Z DEBUG text_generation_launcher: WARNING 07-22 08:04:17 ray_utils.py:46] Failed to import Ray with ModuleNotFoundError("No module named 'ray'"). For distributed inference, please install Ray with `pip install ray`.

but I guess that shouldn't be an issue here since I don't want to perform distributed inference.

Hugoch commented 1 month ago

Signal 4 means TGI tries to use illegal instruction. Are you running in a VM? Can you check your Docker Nvidia install and that AVX instructions are enabled as in https://github.com/huggingface/text-generation-inference/issues/908?

fhamborg commented 1 month ago

Thanks for your reply. Yes, I run this in a VM. Though please note that I can run any other huggingface transformers code directly from within python (e.g. in a conda environment) on that VM.

Here's the output of the docker container when executed with --env

(base) felix@heavy-gpu-02:~/dev/anychat/helpers$ docker run --gpus all ghcr.io/huggingface/text-generation-inference:latest --env
2024-07-22T12:49:51.199087Z  INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.79.0
Commit sha: da82c63a4ff9c6b8f3d0901cb955c8db04c9a492
Docker label: sha-da82c63
nvidia-smi:
Mon Jul 22 12:49:51 2024       
   +-----------------------------------------------------------------------------------------+
   | NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
   |-----------------------------------------+------------------------+----------------------+
   | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
   |                                         |                        |               MIG M. |
   |=========================================+========================+======================|
   |   0  NVIDIA RTX A6000               On  |   00000000:01:00.0 Off |                  Off |
   | 30%   42C    P8             19W /  300W |       1MiB /  49140MiB |      0%      Default |
   |                                         |                        |                  N/A |
   +-----------------------------------------+------------------------+----------------------+

   +-----------------------------------------------------------------------------------------+
   | Processes:                                                                              |
   |  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
   |        ID   ID                                                               Usage      |
   |=========================================================================================|
   |  No running processes found                                                             |
   +-----------------------------------------------------------------------------------------+
xpu-smi:
N/A

And here of nvcc

(base) felix@heavy-gpu-02:~/dev/anychat/helpers$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0
fhamborg commented 1 month ago

Okay apparently the missing avx instructions were the issue here. Before changing anything in the vm manager, I had:

cat /proc/cpuinfo | grep -i avx

which yielded nothing, whereas after enabling avx instructions I got this:

flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core ssbd ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr wbnoinvd arat npt lbrv nrip_save tsc_scale vmcb_clean flushbyasid pausefilter pfthreshold v_vmsave_vmload vgif umip rdpid arch_capabilities

And now the docker image is running nicely :-) I'm sorry for the confusion, I had wrongly assumed that it wouldn't be an issue related to the vm, because I was able to run other torch / transformer code in the vm directly as well as in docker.

Just out of curiosity, do you have an idea, why this particular docker image requires the avx instructions to be available?

Anyways, thanks a lot for the help!

ErikKaum commented 1 month ago

No worries, glad that the issue got resolved @fhamborg 👍

Can't say directly but we're also doing some rework on separating the backend from the frontend. So in the future there would be e.g. different docker images depending on which backend you use (cpu, cuda etc.). The AVX requirement might come from something unexpected which isn't even necessarily used in your case.