Closed egeucak closed 3 months ago
@OlivierDehaene
I followed the exact approach with Falcon7B, and it works fine on a single GPU
Hi, just wanted to follow up on this because I believe I'm experiencing a similar issue. Same error, running inference on 2x80gb A100s on Runpod, following this tutorial: https://www.youtube.com/watch?v=FhY8rx_X97k
I got this error trying to run tiiuae/falcon-40b-instruct on two A100 40GB GPUs. I ran it with these options:
singularity run --nv -B $volume:/data ./text-generation-inference_0.8.sif --model-id $model --sharded $sharded --port 8080 &
(singularity is similar to docker, but we don't have docker installed)
It turns out it needed more GPUs. When I increased GPUs from 2 to 4 A100 40GB it started fine and the curl command to test it returned a result. It might work with 3 GPUs - I haven't tested it yet.
I'm not using Kubernetes. I'm on RHEL 8.
I was getting similar issue then I rolled back the docker image to older version and the model started working.
Image where its working: ghcr.io/huggingface/text-generation-inference@sha256:f4e09f01c1dd38bc2e9c9a66e9de1c2e3dc9912c2781440f7ac1eb70f6b1479e
Model: tiiuae/falcon-7b-instruct NUM_SHARD: 1
No quantization. Hardware: 1xA100 20Gi
I have a similar error since v1.1.0 with one A100 80GB GPU when I start TGI with the following environment variables:
- name: MODEL_ID
value: tiiuae/falcon-7b-instruct
- name: QUANTIZE
value: eetq
If I set the quantization to bitsandbytes
it works fine. Happens also when using the larger tiiuae/falcon-40b-instruct
Error log:
{"timestamp":"2023-09-28T15:26:38.040585Z","level":"ERROR","fields":{"message":"Shard complete
standard error output:\n\nYou are using a model of type RefinedWebModel to instantiate a model
of type . This is not supported for all configurations of models and can yield errors.\nTraceback [...]
@pdeubel, this should only be a warning. Can you provide the whole stacktrace?
Ah yes sorry I actually did not look at the whole stacktrace, seems like eetq
is not installed. I run TGI on Kubernetes, i.e. I am using your Docker Image. Perhaps there is something missing regarding the installation of eetq
?
Whole stacktrace:
{"timestamp":"2023-09-28T15:26:24.924474Z","level":"INFO","fields":{"message":"Args { model_id: \"tiiuae/falcon-7b-instruct\", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: Some(Eetq), dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "...", port: 80, shard_uds_path: \"/tmp/text-generation-server\", master_addr: \"localhost\", master_port: 29500, huggingface_hub_cache: Some(\"/data\"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: true, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }"},"target":"text_generation_launcher"}
{"timestamp":"2023-09-28T15:26:24.924584Z","level":"INFO","fields":{"message":"Starting download process."},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2023-09-28T15:26:27.656490Z","level":"INFO","fields":{"message":"Files are already present on the host. Skipping download.\n"},"target":"text_generation_launcher"}
{"timestamp":"2023-09-28T15:26:28.028217Z","level":"INFO","fields":{"message":"Successfully downloaded weights."},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2023-09-28T15:26:28.028503Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2023-09-28T15:26:37.383475Z","level":"ERROR","fields":{"message":"Error when initializing model\nTraceback (most recent call last):\n File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n sys.exit(app())\n File \"/opt/conda/lib/python3.9/site-packages/typer/main.py\", line 311, in __call__\n return get_command(self)(*args, **kwargs)\n File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 1157, in __call__\n return self.main(*args, **kwargs)\n File \"/opt/conda/lib/python3.9/site-packages/typer/core.py\", line 778, in main\n return _main(\n File \"/opt/conda/lib/python3.9/site-packages/typer/core.py\", line 216, in _main\n rv = self.invoke(ctx)\n File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 1688, in invoke\n return _process_result(sub_ctx.command.invoke(sub_ctx))\n File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 1434, in invoke\n return ctx.invoke(self.callback, **ctx.params)\n File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 783, in invoke\n return __callback(*args, **kwargs)\n File \"/opt/conda/lib/python3.9/site-packages/typer/main.py\", line 683, in wrapper\n return callback(**use_params) # type: ignore\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py\", line 83, in serve\n server.serve(\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 207, in serve\n asyncio.run(\n File \"/opt/conda/lib/python3.9/asyncio/runners.py\", line 44, in run\n return loop.run_until_complete(main)\n File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 634, in run_until_complete\n self.run_forever()\n File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 601, in run_forever\n self._run_once()\n File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 1905, in _run_once\n handle._run()\n File \"/opt/conda/lib/python3.9/asyncio/events.py\", line 80, in _run\n self._context.run(self._callback, *self._args)\n> File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 159, in serve_inner\n model = get_model(\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py\", line 234, in get_model\n return FlashRWSharded(\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_rw.py\", line 67, in __init__\n model = FlashRWForCausalLM(config, weights)\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py\", line 625, in __init__\n self.transformer = FlashRWModel(config, weights)\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py\", line 567, in __init__\n [\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py\", line 568, in <listcomp>\n FlashRWLayer(layer_id, config, weights)\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py\", line 396, in __init__\n self.self_attention = FlashRWAttention(\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py\", line 147, in __init__\n self.query_key_value = TensorParallelColumnLinear.load(\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py\", line 436, in load\n return cls.load_multi(config, [prefix], weights, bias, dim=0)\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py\", line 449, in load_multi\n linear = get_linear(weight, bias, config.quantize)\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py\", line 280, in get_linear\n raise ImportError(\nImportError: Please install EETQ from https://github.com/NetEase-FuXi/EETQ\n"},"target":"text_generation_launcher"}
{"timestamp":"2023-09-28T15:26:38.040585Z","level":"ERROR","fields":{"message":"Shard complete standard error output:\n\nYou are using a model of type RefinedWebModel to instantiate a model of type . This is not supported for all configurations of models and can yield errors.\nTraceback (most recent call last):\n\n File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n sys.exit(app())\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py\", line 83, in serve\n server.serve(\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 207, in serve\n asyncio.run(\n\n File \"/opt/conda/lib/python3.9/asyncio/runners.py\", line 44, in run\n return loop.run_until_complete(main)\n\n File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 647, in run_until_complete\n return future.result()\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 159, in serve_inner\n model = get_model(\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py\", line 234, in get_model\n return FlashRWSharded(\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_rw.py\", line 67, in __init__\n model = FlashRWForCausalLM(config, weights)\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py\", line 625, in __init__\n self.transformer = FlashRWModel(config, weights)\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py\", line 567, in __init__\n [\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py\", line 568, in <listcomp>\n FlashRWLayer(layer_id, config, weights)\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py\", line 396, in __init__\n self.self_attention = FlashRWAttention(\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py\", line 147, in __init__\n self.query_key_value = TensorParallelColumnLinear.load(\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py\", line 436, in load\n return cls.load_multi(config, [prefix], weights, bias, dim=0)\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py\", line 449, in load_multi\n linear = get_linear(weight, bias, config.quantize)\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py\", line 280, in get_linear\n raise ImportError(\n\nImportError: Please install EETQ from https://github.com/NetEase-FuXi/EETQ\n"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
Error: ShardCannotStart
{"timestamp":"2023-09-28T15:26:38.137802Z","level":"ERROR","fields":{"message":"Shard 0 failed to start"},"target":"text_generation_launcher"}
{"timestamp":"2023-09-28T15:26:38.137840Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"}
EETQ is missing from the docker image, my bad on this: https://github.com/huggingface/text-generation-inference/pull/1081
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
System Info
Running huggingface/text-generation-inference:0.8.2 on a kubernetes cluster.
Information
Tasks
Reproduction
Steps to reproduce:
HF_HUB_ENABLE_HF_TRANSFER=1 text-generation-server download-weights tiiuae/falcon-40b
locally/data
I set the following environment variables:After this, I get the following logs from the pod:
Expected behavior
I expect the pod to start successfully