huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
9.11k stars 1.07k forks source link

LLava NEXT 34B does not start: RuntimeError: shape mismatch: value tensor of shape #2195

Closed ktrapeznikov closed 3 months ago

ktrapeznikov commented 4 months ago

System Info

TGI version 2.1.1

tgi-llava-1  | 2024-07-05T20:49:53.276458Z  INFO text_generation_launcher: Runtime environment:
tgi-llava-1  | Target: x86_64-unknown-linux-gnu
tgi-llava-1  | Cargo version: 1.79.0
tgi-llava-1  | Commit sha: 4dfdb481fb1f9cf31561c056061d693f38ba4168
tgi-llava-1  | Docker label: sha-4dfdb48
tgi-llava-1  | nvidia-smi:
tgi-llava-1  | Fri Jul  5 20:49:53 2024       
tgi-llava-1  |    +---------------------------------------------------------------------------------------+
tgi-llava-1  |    | NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
tgi-llava-1  |    |-----------------------------------------+----------------------+----------------------+
tgi-llava-1  |    | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
tgi-llava-1  |    | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
tgi-llava-1  |    |                                         |                      |               MIG M. |
tgi-llava-1  |    |=========================================+======================+======================|
tgi-llava-1  |    |   0  NVIDIA RTX A6000               On  | 00000000:03:00.0 Off |                  Off |
tgi-llava-1  |    | 30%   32C    P8              20W / 300W |      1MiB / 49140MiB |      0%      Default |
tgi-llava-1  |    |                                         |                      |                  N/A |
tgi-llava-1  |    +-----------------------------------------+----------------------+----------------------+
tgi-llava-1  |    |   1  NVIDIA RTX A6000               On  | 00000000:04:00.0 Off |                  Off |
tgi-llava-1  |    | 30%   32C    P8              23W / 300W |      1MiB / 49140MiB |      0%      Default |
tgi-llava-1  |    |                                         |                      |                  N/A |
tgi-llava-1  |    +-----------------------------------------+----------------------+----------------------+
tgi-llava-1  |                                                                                             
tgi-llava-1  |    +---------------------------------------------------------------------------------------+
tgi-llava-1  |    | Processes:                                                                            |
tgi-llava-1  |    |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
tgi-llava-1  |    |        ID   ID                                                             Usage      |
tgi-llava-1  |    |=======================================================================================|
tgi-llava-1  |    |  No running processes found                                                           |
tgi-llava-1  |    +---------------------------------------------------------------------------------------+

Information

Tasks

Reproduction

starting server with llava-hf/llava-v1.6-34b-hf on 2 GPUs

tgi-llava-1  | 2024-07-05T20:54:28.872900Z  INFO text_generation_launcher: Args {
tgi-llava-1  |     model_id: "/data/models--llava-hf--llava-v1.6-34b-hf/snapshots/5400ac92f6e1595288302ba9ab20db8542c0b8e5",
tgi-llava-1  |     revision: None,
tgi-llava-1  |     validation_workers: 2,
tgi-llava-1  |     sharded: None,
tgi-llava-1  |     num_shard: None,
tgi-llava-1  |     quantize: None,
tgi-llava-1  |     speculate: None,
tgi-llava-1  |     dtype: None,
tgi-llava-1  |     trust_remote_code: false,
tgi-llava-1  |     max_concurrent_requests: 128,
tgi-llava-1  |     max_best_of: 2,
tgi-llava-1  |     max_stop_sequences: 4,
tgi-llava-1  |     max_top_n_tokens: 5,
tgi-llava-1  |     max_input_tokens: None,
tgi-llava-1  |     max_input_length: None,
tgi-llava-1  |     max_total_tokens: None,
tgi-llava-1  |     waiting_served_ratio: 0.3,
tgi-llava-1  |     max_batch_prefill_tokens: None,
tgi-llava-1  |     max_batch_total_tokens: None,
tgi-llava-1  |     max_waiting_tokens: 20,
tgi-llava-1  |     max_batch_size: None,
tgi-llava-1  |     cuda_graphs: None,
tgi-llava-1  |     hostname: "0.0.0.0",
tgi-llava-1  |     port: 80,
tgi-llava-1  |     shard_uds_path: "/tmp/text-generation-server",
tgi-llava-1  |     master_addr: "localhost",
tgi-llava-1  |     master_port: 29500,
tgi-llava-1  |     huggingface_hub_cache: Some(
tgi-llava-1  |         "/data",
tgi-llava-1  |     ),
tgi-llava-1  |     weights_cache_override: None,
tgi-llava-1  |     disable_custom_kernels: false,
tgi-llava-1  |     cuda_memory_fraction: 1.0,
tgi-llava-1  |     rope_scaling: None,
tgi-llava-1  |     rope_factor: None,
tgi-llava-1  |     json_output: false,
tgi-llava-1  |     otlp_endpoint: None,
tgi-llava-1  |     otlp_service_name: "text-generation-inference.router",
tgi-llava-1  |     cors_allow_origin: [],
tgi-llava-1  |     watermark_gamma: None,
tgi-llava-1  |     watermark_delta: None,
tgi-llava-1  |     ngrok: false,
tgi-llava-1  |     ngrok_authtoken: None,
tgi-llava-1  |     ngrok_edge: None,
tgi-llava-1  |     tokenizer_config_path: None,
tgi-llava-1  |     disable_grammar_support: false,
tgi-llava-1  |     env: false,
tgi-llava-1  |     max_client_batch_size: 4,
tgi-llava-1  |     lora_adapters: None,
tgi-llava-1  | }
tgi-llava-1  | 2024-07-05T20:54:28.873010Z  INFO text_generation_launcher: Default `max_input_tokens` to 4095
tgi-llava-1  | 2024-07-05T20:54:28.873016Z  INFO text_generation_launcher: Default `max_total_tokens` to 4096
tgi-llava-1  | 2024-07-05T20:54:28.873020Z  INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 4145
tgi-llava-1  | 2024-07-05T20:54:28.873024Z  INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
tgi-llava-1  | 2024-07-05T20:54:28.873038Z  INFO text_generation_launcher: Sharding model on 2 processes
tgi-llava-1  | 2024-07-05T20:54:28.873229Z  INFO download: text_generation_launcher: Starting check and download process for /data/models--llava-hf--llava-v1.6-34b-hf/snapshots/5400ac92f6e1595288302ba9ab20db8542c0b8e5
tgi-llava-1  | 2024-07-05T20:54:30.327241Z  INFO text_generation_launcher: Detected system cuda
tgi-llava-1  | 2024-07-05T20:54:31.894234Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.
tgi-llava-1  | 2024-07-05T20:54:32.577786Z  INFO download: text_generation_launcher: Successfully downloaded weights for /data/models--llava-hf--llava-v1.6-34b-hf/snapshots/5400ac92f6e1595288302ba9ab20db8542c0b8e5
tgi-llava-1  | 2024-07-05T20:54:32.578123Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
tgi-llava-1  | 2024-07-05T20:54:32.578292Z  INFO shard-manager: text_generation_launcher: Starting shard rank=1
tgi-llava-1  | 2024-07-05T20:54:34.205108Z  INFO text_generation_launcher: Detected system cuda
tgi-llava-1  | 2024-07-05T20:54:34.206359Z  INFO text_generation_launcher: Detected system cuda
tgi-llava-1  | 2024-07-05T20:54:42.589249Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
tgi-llava-1  | 2024-07-05T20:54:42.589930Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
tgi-llava-1  | 2024-07-05T20:54:51.337199Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
tgi-llava-1  | 2024-07-05T20:54:51.396869Z  INFO shard-manager: text_generation_launcher: Shard ready in 18.817230309s rank=0
tgi-llava-1  | 2024-07-05T20:54:51.435013Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-1
tgi-llava-1  | 2024-07-05T20:54:51.496975Z  INFO shard-manager: text_generation_launcher: Shard ready in 18.916861875s rank=1
tgi-llava-1  | 2024-07-05T20:54:51.593025Z  INFO text_generation_launcher: Starting Webserver
tgi-llava-1  | 2024-07-05T20:54:51.761758Z  WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|reserved000|>' was expected to have ID '64000' but was given ID 'None'    
tgi-llava-1  | 2024-07-05T20:54:51.761789Z  WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|reserved001|>' was expected to have ID '64001' but was given ID 'None'    
tgi-llava-1  | 2024-07-05T20:54:51.761793Z  WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|reserved002|>' was expected to have ID '64002' but was given ID 'None'    
tgi-llava-1  | 2024-07-05T20:54:51.761795Z  WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<image>' was expected to have ID '64003' but was given ID 'None'    
tgi-llava-1  | 2024-07-05T20:54:51.763979Z  INFO text_generation_router: router/src/main.rs:330: Overriding LlamaTokenizer with TemplateProcessing to follow python override defined in https://github.com/huggingface/transformers/blob/4aa17d00690b7f82c95bb2949ea57e22c35b4336/src/transformers/models/llama/tokenization_llama_fast.py#L203-L205
tgi-llava-1  | 2024-07-05T20:54:51.764068Z  INFO text_generation_router: router/src/main.rs:345: Using config Some(LlavaNext(LlavaNext { text_config: TextConfig, vision_config: VisionConfig { image_size: 336, patch_size: 14 }, image_grid_pinpoints: [(336, 672), (672, 336), (672, 672), (1008, 336), (336, 1008)] }))
tgi-llava-1  | 2024-07-05T20:54:51.764147Z  WARN text_generation_router: router/src/main.rs:354: no pipeline tag found for model /data/models--llava-hf--llava-v1.6-34b-hf/snapshots/5400ac92f6e1595288302ba9ab20db8542c0b8e5
tgi-llava-1  | 2024-07-05T20:54:51.768672Z  INFO text_generation_router::server: router/src/server.rs:1567: Warming up model
tgi-llava-1  | 2024-07-05T20:54:51.833869Z  INFO text_generation_launcher: Found 1176 features in image of resolution 20x20
tgi-llava-1  | 2024-07-05T20:54:51.862948Z  INFO text_generation_launcher: Found 1176 features in image of resolution 20x20
tgi-llava-1  | 2024-07-05T20:54:52.595889Z ERROR text_generation_launcher: Method Warmup encountered an error.
tgi-llava-1  | Traceback (most recent call last):
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/llava_next.py", line 158, in _merge_input_ids_with_image_features
tgi-llava-1  |     inputs_embeds[mask] = image_features.view(-1, image_features.shape[-1])
tgi-llava-1  | RuntimeError: shape mismatch: value tensor of shape [1176, 7168] cannot be broadcast to indexing result of shape [0, 7168]
tgi-llava-1  | 
tgi-llava-1  | During handling of the above exception, another exception occurred:
tgi-llava-1  | 
tgi-llava-1  | Traceback (most recent call last):
tgi-llava-1  |   File "/opt/conda/bin/text-generation-server", line 8, in <module>
tgi-llava-1  |     sys.exit(app())
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
tgi-llava-1  |     return get_command(self)(*args, **kwargs)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
tgi-llava-1  |     return self.main(*args, **kwargs)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
tgi-llava-1  |     return _main(
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
tgi-llava-1  |     rv = self.invoke(ctx)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
tgi-llava-1  |     return _process_result(sub_ctx.command.invoke(sub_ctx))
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
tgi-llava-1  |     return ctx.invoke(self.callback, **ctx.params)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
tgi-llava-1  |     return __callback(*args, **kwargs)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
tgi-llava-1  |     return callback(**use_params)  # type: ignore
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 106, in serve
tgi-llava-1  |     server.serve(
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve
tgi-llava-1  |     asyncio.run(
tgi-llava-1  |   File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
tgi-llava-1  |     return loop.run_until_complete(main)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
tgi-llava-1  |     self.run_forever()
tgi-llava-1  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
tgi-llava-1  |     self._run_once()
tgi-llava-1  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
tgi-llava-1  |     handle._run()
tgi-llava-1  |   File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
tgi-llava-1  |     self._context.run(self._callback, *self._args)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
tgi-llava-1  |     return await self.intercept(
tgi-llava-1  | > File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
tgi-llava-1  |     return await response
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
tgi-llava-1  |     raise error
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
tgi-llava-1  |     return await behavior(request_or_iterator, context)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 125, in Warmup
tgi-llava-1  |     max_supported_total_tokens = self.model.warmup(batch)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 985, in warmup
tgi-llava-1  |     _, batch, _ = self.generate_token(batch)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
tgi-llava-1  |     return func(*args, **kwds)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1253, in generate_token
tgi-llava-1  |     out, speculative_logits = self.forward(batch, adapter_data)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/vlm_causal_lm.py", line 319, in forward
tgi-llava-1  |     logits, speculative_logits = self.model.forward(
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/llava_next.py", line 268, in forward
tgi-llava-1  |     inputs_embeds = self._merge_input_ids_with_image_features(
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/llava_next.py", line 160, in _merge_input_ids_with_image_features
tgi-llava-1  |     raise RuntimeError(
tgi-llava-1  | RuntimeError: Cannot fill images right now. If error happens at warmup, make sure you have enough `--max-input-tokens`  to handle images. If error happens at regular runtime, please fill in an issue: shape mismatch: value tensor of shape [1176, 7168] cannot be broadcast to indexing result of shape [0, 7168]

Expected behavior

it should start w/o errors

LysandreJik commented 4 months ago

Thanks for your issue @ktrapeznikov! We're taking a look.

Maybe cc @Narsil who has contributed that code.

ktrapeznikov commented 4 months ago

It works with the smaller model llava-hf/llava-v1.6-mistral-7b-hf

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.