huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.86k stars 1.04k forks source link

Error loading Qwen2-72B-Instruct with EETQ #2126

Closed michaelthreet closed 1 month ago

michaelthreet commented 3 months ago

System Info

OS version: Ubuntu 22.04 Model being used: Qwen/Qwen2-72B-Instruct Hardware being used: 4x 40GB A100 Deployment specificities: Running via docker using the latest tag as of 06/26

Information

Tasks

Reproduction

Command was:

volume=/data/hf_cache
tgi_version="latest"
model_id="Qwen/Qwen2-72B-Instruct"
num_shard="4"
max_input_length=32767
max_batch_prefill_tokens=$max_input_length
max_total_tokens=32768

docker run \
    -d \
    --name tgi \
    --gpus all \
    --shm-size 1g \
    -p 3000:3001 \
    -v $volume:/data \
    --restart=always harbor.aip.mitre.org/huggingface/text-generation-inference:$tgi_version \
    --model-id $model_id \
    --huggingface-hub-cache /data \
    --num-shard $num_shard \
    --max-input-length $max_input_length \
    --max-batch-prefill-tokens $max_batch_prefill_tokens \
    --max-total-tokens $max_total_tokens \
    --quantize eetq

Expected behavior

A running LLM. The output I get is below:

2024-06-26T19:46:48.038583Z  INFO text_generation_launcher: Args {
    model_id: "Qwen/Qwen2-72B-Instruct",
    revision: None,
    validation_workers: 2,
    sharded: None,
    num_shard: Some(
        4,
    ),
    quantize: Some(
        Eetq,
    ),
    speculate: None,
    dtype: None,
    trust_remote_code: false,
    max_concurrent_requests: 128,
    max_best_of: 2,
    max_stop_sequences: 4,
    max_top_n_tokens: 5,
    max_input_tokens: None,
    max_input_length: Some(
        32767,
    ),
    max_total_tokens: Some(
        32768,
    ),
    waiting_served_ratio: 0.3,
    max_batch_prefill_tokens: Some(
        32767,
    ),
    max_batch_total_tokens: None,
    max_waiting_tokens: 20,
    max_batch_size: None,
    cuda_graphs: None,
    hostname: "290a3e43304e",
    port: 80,
    shard_uds_path: "/tmp/text-generation-server",
    master_addr: "localhost",
    master_port: 29500,
    huggingface_hub_cache: Some(
        "/data",
    ),
    weights_cache_override: None,
    disable_custom_kernels: false,
    cuda_memory_fraction: 1.0,
    rope_scaling: None,
    rope_factor: None,
    json_output: false,
    otlp_endpoint: None,
    otlp_service_name: "text-generation-inference.router",
    cors_allow_origin: [],
    watermark_gamma: None,
    watermark_delta: None,
    ngrok: false,
    ngrok_authtoken: None,
    ngrok_edge: None,
    tokenizer_config_path: None,
    disable_grammar_support: false,
    env: false,
    max_client_batch_size: 4,
    lora_adapters: None,
}
2024-06-26T19:46:48.038727Z  INFO hf_hub: Token file not found "/root/.cache/huggingface/token"    
2024-06-26T19:46:48.067446Z  INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-06-26T19:46:48.067460Z  INFO text_generation_launcher: Sharding model on 4 processes
2024-06-26T19:46:48.067564Z  INFO download: text_generation_launcher: Starting download process.
2024-06-26T19:46:49.913089Z  INFO text_generation_launcher: Detected system cuda
2024-06-26T19:46:51.763885Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.
2024-06-26T19:46:52.576390Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2024-06-26T19:46:52.576816Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-06-26T19:46:52.576827Z  INFO shard-manager: text_generation_launcher: Starting shard rank=2
2024-06-26T19:46:52.576829Z  INFO shard-manager: text_generation_launcher: Starting shard rank=1
2024-06-26T19:46:52.576870Z  INFO shard-manager: text_generation_launcher: Starting shard rank=3
2024-06-26T19:46:54.847550Z  INFO text_generation_launcher: Detected system cuda
2024-06-26T19:46:54.864114Z  INFO text_generation_launcher: Detected system cuda
2024-06-26T19:46:54.877120Z  INFO text_generation_launcher: Detected system cuda
2024-06-26T19:46:54.883280Z  INFO text_generation_launcher: Detected system cuda
2024-06-26T19:47:02.590571Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-06-26T19:47:02.591274Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-06-26T19:47:02.592424Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
2024-06-26T19:47:02.593881Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
...
...
...
2024-06-26T19:54:43.498217Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-06-26T19:54:45.986445Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
2024-06-26T19:54:45.998914Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-06-26T19:54:46.063442Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
2024-06-26T19:54:51.284297Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-2
2024-06-26T19:54:51.294666Z  INFO shard-manager: text_generation_launcher: Shard ready in 478.716316913s rank=2
2024-06-26T19:54:51.860967Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-06-26T19:54:51.861355Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-3
2024-06-26T19:54:51.862223Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-1
2024-06-26T19:54:51.872910Z  INFO shard-manager: text_generation_launcher: Shard ready in 479.294074083s rank=3
2024-06-26T19:54:51.908038Z  INFO shard-manager: text_generation_launcher: Shard ready in 479.32951286s rank=0
2024-06-26T19:54:51.909521Z  INFO shard-manager: text_generation_launcher: Shard ready in 479.330793679s rank=1
2024-06-26T19:54:51.978363Z  INFO text_generation_launcher: Starting Webserver
2024-06-26T19:54:52.051402Z  INFO text_generation_router: router/src/main.rs:199: Using the Hugging Face API
2024-06-26T19:54:52.051440Z  INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token"    
2024-06-26T19:54:52.333418Z  INFO text_generation_router: router/src/main.rs:453: Serving revision 1af63c698f59c4235668ec9c1395468cb7cd7e79 of model Qwen/Qwen2-72B-Instruct
2024-06-26T19:54:52.568054Z  WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|endoftext|>' was expected to have ID '151643' but was given ID 'None'    
2024-06-26T19:54:52.568081Z  WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|im_start|>' was expected to have ID '151644' but was given ID 'None'    
2024-06-26T19:54:52.568084Z  WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|im_end|>' was expected to have ID '151645' but was given ID 'None'    
2024-06-26T19:54:52.573774Z  INFO text_generation_router: router/src/main.rs:307: Using config Some(Qwen2)
2024-06-26T19:54:52.573792Z  WARN text_generation_router: router/src/main.rs:334: Invalid hostname, defaulting to 0.0.0.0
2024-06-26T19:54:52.578282Z  INFO text_generation_router::server: router/src/server.rs:1554: Warming up model
2024-06-26T19:54:54.144694Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 106, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
    return await response
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 125, in Warmup
    max_supported_total_tokens = self.model.warmup(batch)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 961, in warmup
    _, batch, _ = self.generate_token(batch)
  File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1227, in generate_token
    out, speculative_logits = self.forward(batch, adapter_data)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1152, in forward
    logits, speculative_logits = self.model.forward(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 373, in forward
    hidden_states = self.model(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 314, in forward
    hidden_states, residual = layer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 257, in forward
    mlp_output = self.mlp(normed_attn_res_output)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 202, in forward
    return self.down_proj(self.act(gate_up_states[:, 0]) * gate_up_states[:, 1])
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/tensor_parallel.py", line 218, in forward
    out = super().forward(input)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/tensor_parallel.py", line 37, in forward
    return self.linear.forward(x)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/eetq.py", line 23, in forward
    output = w8_a16_gemm(input, self.weight, self.scale)
RuntimeError: [FT Error] Heurisitc failed to find a valid config.
2024-06-26T19:54:54.144776Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 106, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
    return await response
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 125, in Warmup
    max_supported_total_tokens = self.model.warmup(batch)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 961, in warmup
    _, batch, _ = self.generate_token(batch)
  File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1227, in generate_token
    out, speculative_logits = self.forward(batch, adapter_data)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1152, in forward
    logits, speculative_logits = self.model.forward(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 373, in forward
    hidden_states = self.model(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 314, in forward
    hidden_states, residual = layer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 257, in forward
    mlp_output = self.mlp(normed_attn_res_output)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 202, in forward
    return self.down_proj(self.act(gate_up_states[:, 0]) * gate_up_states[:, 1])
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/tensor_parallel.py", line 218, in forward
    out = super().forward(input)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/tensor_parallel.py", line 37, in forward
    return self.linear.forward(x)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/eetq.py", line 23, in forward
    output = w8_a16_gemm(input, self.weight, self.scale)
RuntimeError: [FT Error] Heurisitc failed to find a valid config.
2024-06-26T19:54:54.144786Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 106, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
    return await response
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 125, in Warmup
    max_supported_total_tokens = self.model.warmup(batch)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 961, in warmup
    _, batch, _ = self.generate_token(batch)
  File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1227, in generate_token
    out, speculative_logits = self.forward(batch, adapter_data)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1152, in forward
    logits, speculative_logits = self.model.forward(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 373, in forward
    hidden_states = self.model(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 314, in forward
    hidden_states, residual = layer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 257, in forward
    mlp_output = self.mlp(normed_attn_res_output)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 202, in forward
    return self.down_proj(self.act(gate_up_states[:, 0]) * gate_up_states[:, 1])
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/tensor_parallel.py", line 218, in forward
    out = super().forward(input)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/tensor_parallel.py", line 37, in forward
    return self.linear.forward(x)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/eetq.py", line 23, in forward
    output = w8_a16_gemm(input, self.weight, self.scale)
RuntimeError: [FT Error] Heurisitc failed to find a valid config.
2024-06-26T19:54:54.144835Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 106, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
    return await response
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 125, in Warmup
    max_supported_total_tokens = self.model.warmup(batch)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 961, in warmup
    _, batch, _ = self.generate_token(batch)
  File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1227, in generate_token
    out, speculative_logits = self.forward(batch, adapter_data)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1152, in forward
    logits, speculative_logits = self.model.forward(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 373, in forward
    hidden_states = self.model(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 314, in forward
    hidden_states, residual = layer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 257, in forward
    mlp_output = self.mlp(normed_attn_res_output)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 202, in forward
    return self.down_proj(self.act(gate_up_states[:, 0]) * gate_up_states[:, 1])
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/tensor_parallel.py", line 218, in forward
    out = super().forward(input)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/tensor_parallel.py", line 37, in forward
    return self.linear.forward(x)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/eetq.py", line 23, in forward
    output = w8_a16_gemm(input, self.weight, self.scale)
RuntimeError: [FT Error] Heurisitc failed to find a valid config.
2024-06-26T19:54:54.358640Z ERROR warmup{max_input_length=32767 max_prefill_tokens=32767 max_total_tokens=32768 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
2024-06-26T19:54:54.379404Z ERROR warmup{max_input_length=32767 max_prefill_tokens=32767 max_total_tokens=32768 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
2024-06-26T19:54:54.382335Z ERROR warmup{max_input_length=32767 max_prefill_tokens=32767 max_total_tokens=32768 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
2024-06-26T19:54:54.388277Z ERROR warmup{max_input_length=32767 max_prefill_tokens=32767 max_total_tokens=32768 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
Error: WebServer(Warmup(Generation("CANCELLED")))
2024-06-26T19:54:54.536419Z ERROR text_generation_launcher: Webserver Crashed
2024-06-26T19:54:54.536440Z  INFO text_generation_launcher: Shutting down shards
2024-06-26T19:54:54.577779Z  INFO shard-manager: text_generation_launcher: Terminating shard rank=3
2024-06-26T19:54:54.579074Z  INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=3
2024-06-26T19:54:54.600131Z  INFO shard-manager: text_generation_launcher: Terminating shard rank=2
2024-06-26T19:54:54.601631Z  INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=2
2024-06-26T19:54:54.612581Z  INFO shard-manager: text_generation_launcher: Terminating shard rank=0
2024-06-26T19:54:54.614073Z  INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=0
2024-06-26T19:54:54.615939Z  INFO shard-manager: text_generation_launcher: Terminating shard rank=1
2024-06-26T19:54:54.617687Z  INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=1
2024-06-26T19:54:55.380226Z  INFO shard-manager: text_generation_launcher: shard terminated rank=3
2024-06-26T19:54:55.503083Z  INFO shard-manager: text_generation_launcher: shard terminated rank=2
2024-06-26T19:54:55.515224Z  INFO shard-manager: text_generation_launcher: shard terminated rank=0
Error: WebserverFailed
2024-06-26T19:54:55.619102Z  INFO shard-manager: text_generation_launcher: shard terminated rank=1
LysandreJik commented 3 months ago

Thanks for your issue @michaelthreet! Interestingly enough the error you get RuntimeError: [FT Error] Heurisitc failed to find a valid config. seems to come from TRT-LLM: https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp#L372

Maybe cc @Narsil or @mfuntowicz if you've seen this error before

github-actions[bot] commented 2 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.