egeucak commented 1 year ago

System Info

Running huggingface/text-generation-inference:0.8.2 on a kubernetes cluster.

2023-06-13T15:28:49.039767Z  INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.69.0
Commit sha: e7248fe90e27c7c8e39dd4cac5874eb9f96ab182
Docker label: sha-e7248fe
nvidia-smi:
Tue Jun 13 15:28:48 2023       
   +-----------------------------------------------------------------------------+
   | NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
   |-------------------------------+----------------------+----------------------+
   | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
   |                               |                      |               MIG M. |
   |===============================+======================+======================|
   |   0  NVIDIA A100-PCI...  Off  | 00000000:3B:00.0 Off |                    0 |
   | N/A   36C    P0    37W / 250W |      0MiB / 40960MiB |      0%      Default |
   |                               |                      |             Disabled |
   +-------------------------------+----------------------+----------------------+
   |   1  NVIDIA A100-PCI...  Off  | 00000000:D8:00.0 Off |                    0 |
   | N/A   39C    P0    37W / 250W |      0MiB / 40960MiB |      0%      Default |
   |                               |                      |             Disabled |
   +-------------------------------+----------------------+----------------------+

   +-----------------------------------------------------------------------------+
   | Processes:                                                                  |
   |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
   |        ID   ID                                                   Usage      |
   |=============================================================================|
   |  No running processes found                                                 |
   +-----------------------------------------------------------------------------+
2023-06-13T15:28:49.039864Z  INFO text_generation_launcher: Args { model_id: "/data/models--tiiuae--falcon-40b/snapshots/2ac60b04625e6694fb6143c00b9f93a01c7a000f/", revision: None, sharded: None, num_shard: None, quantize: Some(Bitsandbytes), trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1000, max_total_tokens: 1512, max_batch_size: None, waiting_served_ratio: 1.2, max_batch_total_tokens: 32000, max_waiting_tokens: 20, port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: true, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, env: true }

Information

[X] Docker
[ ] The CLI directly

Tasks

[x] An officially supported command
[ ] My own modifications

Reproduction

Steps to reproduce:

Run HF_HUB_ENABLE_HF_TRANSFER=1 text-generation-server download-weights tiiuae/falcon-40b locally
Move the downloaded cache to a tightly sealed kubernetes cluster (to a PVC)
Move the contents of tiiuae/falcon-40b repository to that PVC as well aside from the weights
Contents of the folder:

Create kubernetes resources for running the image. Also mount the volume mentioned above to /data I set the following environment variables:

- name: MODEL_ID
value: >-
/data/models--tiiuae--falcon-40b/snapshots/2ac60b04625e6694fb6143c00b9f93a01c7a000f/
- name: QUANTIZE
value: bitsandbytes
- name: DISABLE_CUSTOM_KERNELS
value: 'true'

After this, I get the following logs from the pod:

{"timestamp":"2023-06-13T14:39:19.320908Z","level":"INFO","fields":{"message":"Args { model_id: \"/data/models--tiiuae--falcon-40b/snapshots/2ac60b04625e6694fb6143c00b9f93a01c7a000f/\", revision: None, sharded: None, num_shard: None, quantize: Some(Bitsandbytes), trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1000, max_total_tokens: 1512, max_batch_size: None, waiting_served_ratio: 1.2, max_batch_total_tokens: 32000, max_waiting_tokens: 20, port: 80, shard_uds_path: \"/tmp/text-generation-server\", master_addr: \"localhost\", master_port: 29500, huggingface_hub_cache: Some(\"/data\"), weights_cache_override: None, disable_custom_kernels: true, json_output: true, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, env: false }"},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:39:19.320981Z","level":"INFO","fields":{"message":"Sharding model on 2 processes"},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:39:19.321270Z","level":"INFO","fields":{"message":"Starting download process."},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:39:21.469258Z","level":"INFO","fields":{"message":"Files are already present on the host. Skipping download.\n"},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2023-06-13T14:39:22.125552Z","level":"INFO","fields":{"message":"Successfully downloaded weights."},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:39:22.126276Z","level":"INFO","fields":{"message":"Starting shard 1"},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:39:22.126295Z","level":"INFO","fields":{"message":"Starting shard 0"},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:39:32.145306Z","level":"INFO","fields":{"message":"Waiting for shard 0 to be ready..."},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:39:32.153372Z","level":"INFO","fields":{"message":"Waiting for shard 1 to be ready..."},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:39:42.168074Z","level":"INFO","fields":{"message":"Waiting for shard 1 to be ready..."},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:39:42.169130Z","level":"INFO","fields":{"message":"Waiting for shard 0 to be ready..."},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:39:52.193505Z","level":"INFO","fields":{"message":"Waiting for shard 0 to be ready..."},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:39:52.217776Z","level":"INFO","fields":{"message":"Waiting for shard 1 to be ready..."},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:40:02.266260Z","level":"INFO","fields":{"message":"Waiting for shard 0 to be ready..."},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:40:02.276230Z","level":"INFO","fields":{"message":"Waiting for shard 1 to be ready..."},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:40:12.326939Z","level":"INFO","fields":{"message":"Waiting for shard 1 to be ready..."},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:40:12.349062Z","level":"INFO","fields":{"message":"Waiting for shard 0 to be ready..."},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:40:22.337891Z","level":"INFO","fields":{"message":"Waiting for shard 1 to be ready..."},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:40:32.348584Z","level":"INFO","fields":{"message":"Waiting for shard 1 to be ready..."},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:40:42.359697Z","level":"INFO","fields":{"message":"Waiting for shard 1 to be ready..."},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:40:52.370871Z","level":"INFO","fields":{"message":"Waiting for shard 1 to be ready..."},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:41:02.381893Z","level":"INFO","fields":{"message":"Waiting for shard 1 to be ready..."},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:41:12.392017Z","level":"INFO","fields":{"message":"Waiting for shard 1 to be ready..."},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:41:22.126054Z","level":"ERROR","fields":{"message":"Shard 0 failed to start:\nYou are using a model of type RefinedWeb to instantiate a model of type . This is not supported for all configurations of models and can yield errors.\n"},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:41:22.126114Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:41:22.331289Z","level":"INFO","fields":{"message":"Shard 1 terminated"},"target":"text_generation_launcher"}
Error: ShardCannotStart

Expected behavior

I expect the pod to start successfully

egeucak commented 1 year ago

@OlivierDehaene

448

egeucak commented 1 year ago

I followed the exact approach with Falcon7B, and it works fine on a single GPU

advait-patel-17 commented 1 year ago

Hi, just wanted to follow up on this because I believe I'm experiencing a similar issue. Same error, running inference on 2x80gb A100s on Runpod, following this tutorial: https://www.youtube.com/watch?v=FhY8rx_X97k

cmann50 commented 1 year ago

I got this error trying to run tiiuae/falcon-40b-instruct on two A100 40GB GPUs. I ran it with these options:

singularity run --nv -B $volume:/data ./text-generation-inference_0.8.sif --model-id $model --sharded $sharded --port 8080 &

(singularity is similar to docker, but we don't have docker installed)

It turns out it needed more GPUs. When I increased GPUs from 2 to 4 A100 40GB it started fine and the curl command to test it returned a result. It might work with 3 GPUs - I haven't tested it yet.

I'm not using Kubernetes. I'm on RHEL 8.

rahuldshetty commented 1 year ago

I was getting similar issue then I rolled back the docker image to older version and the model started working.

Image where its working: ghcr.io/huggingface/text-generation-inference@sha256:f4e09f01c1dd38bc2e9c9a66e9de1c2e3dc9912c2781440f7ac1eb70f6b1479e

Model: tiiuae/falcon-7b-instruct NUM_SHARD: 1

No quantization. Hardware: 1xA100 20Gi

pdeubel commented 1 year ago

I have a similar error since v1.1.0 with one A100 80GB GPU when I start TGI with the following environment variables:

- name: MODEL_ID
  value: tiiuae/falcon-7b-instruct
- name: QUANTIZE
  value: eetq

If I set the quantization to bitsandbytes it works fine. Happens also when using the larger tiiuae/falcon-40b-instruct

Error log:

{"timestamp":"2023-09-28T15:26:38.040585Z","level":"ERROR","fields":{"message":"Shard complete
standard error output:\n\nYou are using a model of type RefinedWebModel to instantiate a model
of type . This is not supported for all configurations of models and can yield errors.\nTraceback [...]

OlivierDehaene commented 1 year ago

@pdeubel, this should only be a warning. Can you provide the whole stacktrace?

pdeubel commented 1 year ago

Ah yes sorry I actually did not look at the whole stacktrace, seems like eetq is not installed. I run TGI on Kubernetes, i.e. I am using your Docker Image. Perhaps there is something missing regarding the installation of eetq?

Whole stacktrace:

{"timestamp":"2023-09-28T15:26:24.924474Z","level":"INFO","fields":{"message":"Args { model_id: \"tiiuae/falcon-7b-instruct\", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: Some(Eetq), dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "...", port: 80, shard_uds_path: \"/tmp/text-generation-server\", master_addr: \"localhost\", master_port: 29500, huggingface_hub_cache: Some(\"/data\"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: true, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }"},"target":"text_generation_launcher"}
{"timestamp":"2023-09-28T15:26:24.924584Z","level":"INFO","fields":{"message":"Starting download process."},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2023-09-28T15:26:27.656490Z","level":"INFO","fields":{"message":"Files are already present on the host. Skipping download.\n"},"target":"text_generation_launcher"}
{"timestamp":"2023-09-28T15:26:28.028217Z","level":"INFO","fields":{"message":"Successfully downloaded weights."},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2023-09-28T15:26:28.028503Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2023-09-28T15:26:37.383475Z","level":"ERROR","fields":{"message":"Error when initializing model\nTraceback (most recent call last):\n  File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n    sys.exit(app())\n  File \"/opt/conda/lib/python3.9/site-packages/typer/main.py\", line 311, in __call__\n    return get_command(self)(*args, **kwargs)\n  File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 1157, in __call__\n    return self.main(*args, **kwargs)\n  File \"/opt/conda/lib/python3.9/site-packages/typer/core.py\", line 778, in main\n    return _main(\n  File \"/opt/conda/lib/python3.9/site-packages/typer/core.py\", line 216, in _main\n    rv = self.invoke(ctx)\n  File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 1688, in invoke\n    return _process_result(sub_ctx.command.invoke(sub_ctx))\n  File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 1434, in invoke\n    return ctx.invoke(self.callback, **ctx.params)\n  File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 783, in invoke\n    return __callback(*args, **kwargs)\n  File \"/opt/conda/lib/python3.9/site-packages/typer/main.py\", line 683, in wrapper\n    return callback(**use_params)  # type: ignore\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py\", line 83, in serve\n    server.serve(\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 207, in serve\n    asyncio.run(\n  File \"/opt/conda/lib/python3.9/asyncio/runners.py\", line 44, in run\n    return loop.run_until_complete(main)\n  File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 634, in run_until_complete\n    self.run_forever()\n  File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 601, in run_forever\n    self._run_once()\n  File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 1905, in _run_once\n    handle._run()\n  File \"/opt/conda/lib/python3.9/asyncio/events.py\", line 80, in _run\n    self._context.run(self._callback, *self._args)\n> File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 159, in serve_inner\n    model = get_model(\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py\", line 234, in get_model\n    return FlashRWSharded(\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_rw.py\", line 67, in __init__\n    model = FlashRWForCausalLM(config, weights)\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py\", line 625, in __init__\n    self.transformer = FlashRWModel(config, weights)\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py\", line 567, in __init__\n    [\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py\", line 568, in <listcomp>\n    FlashRWLayer(layer_id, config, weights)\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py\", line 396, in __init__\n    self.self_attention = FlashRWAttention(\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py\", line 147, in __init__\n    self.query_key_value = TensorParallelColumnLinear.load(\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py\", line 436, in load\n    return cls.load_multi(config, [prefix], weights, bias, dim=0)\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py\", line 449, in load_multi\n    linear = get_linear(weight, bias, config.quantize)\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py\", line 280, in get_linear\n    raise ImportError(\nImportError: Please install EETQ from https://github.com/NetEase-FuXi/EETQ\n"},"target":"text_generation_launcher"}
{"timestamp":"2023-09-28T15:26:38.040585Z","level":"ERROR","fields":{"message":"Shard complete standard error output:\n\nYou are using a model of type RefinedWebModel to instantiate a model of type . This is not supported for all configurations of models and can yield errors.\nTraceback (most recent call last):\n\n  File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n    sys.exit(app())\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py\", line 83, in serve\n    server.serve(\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 207, in serve\n    asyncio.run(\n\n  File \"/opt/conda/lib/python3.9/asyncio/runners.py\", line 44, in run\n    return loop.run_until_complete(main)\n\n  File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 647, in run_until_complete\n    return future.result()\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 159, in serve_inner\n    model = get_model(\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py\", line 234, in get_model\n    return FlashRWSharded(\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_rw.py\", line 67, in __init__\n    model = FlashRWForCausalLM(config, weights)\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py\", line 625, in __init__\n    self.transformer = FlashRWModel(config, weights)\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py\", line 567, in __init__\n    [\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py\", line 568, in <listcomp>\n    FlashRWLayer(layer_id, config, weights)\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py\", line 396, in __init__\n    self.self_attention = FlashRWAttention(\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py\", line 147, in __init__\n    self.query_key_value = TensorParallelColumnLinear.load(\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py\", line 436, in load\n    return cls.load_multi(config, [prefix], weights, bias, dim=0)\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py\", line 449, in load_multi\n    linear = get_linear(weight, bias, config.quantize)\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py\", line 280, in get_linear\n    raise ImportError(\n\nImportError: Please install EETQ from https://github.com/NetEase-FuXi/EETQ\n"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
Error: ShardCannotStart
{"timestamp":"2023-09-28T15:26:38.137802Z","level":"ERROR","fields":{"message":"Shard 0 failed to start"},"target":"text_generation_launcher"}
{"timestamp":"2023-09-28T15:26:38.137840Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"}

Narsil commented 1 year ago

EETQ is missing from the docker image, my bad on this: https://github.com/huggingface/text-generation-inference/pull/1081

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

huggingface / text-generation-inference

Using a model of type RefinedWeb to instantiate a model of type . #450

System Info

Information

Tasks

Reproduction

Expected behavior

448