Llava Next crashes on certain image sizes

ktrapeznikov commented 1 week ago

System Info

Running in docker

Target: x86_64-unknown-linux-gnu
Cargo version: 1.75.0
Commit sha: 00f365353ea5cf29438ba1d51baadaab79ae4674
Docker label: sha-00f3653
nvidia-smi:
Sat Apr 20 00:19:12 2024       
   +---------------------------------------------------------------------------------------+
   | NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
   |-----------------------------------------+----------------------+----------------------+
   | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
   |                                         |                      |               MIG M. |
   |=========================================+======================+======================|
   |   0  NVIDIA RTX A6000               On  | 00000000:44:00.0 Off |                  Off |
   | 30%   33C    P8              21W / 300W |  45076MiB / 49140MiB |      0%      Default |
   |                                         |                      |                  N/A |
   +-----------------------------------------+----------------------+----------------------+

   +---------------------------------------------------------------------------------------+
   | Processes:                                                                            |
   |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
   |        ID   ID                                                             Usage      |
   |=======================================================================================|
   +---------------------------------------------------------------------------------------+

CLI Arguments

 model_id: "llava-hf/llava-v1.6-34b-hf", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: Some(BitsandbytesNF4), speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: Some(4095), max_total_tokens: Some(4096), waiting_served_ratio: 1.2, max_batch_prefill_tokens: Some(4096), max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "0.0.0.0", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: true

Info

{
  "model_id": "llava-hf/llava-v1.6-34b-hf",
  "model_sha": "5400ac92f6e1595288302ba9ab20db8542c0b8e5",
  "model_dtype": "torch.float16",
  "model_device_type": "cuda",
  "model_pipeline_tag": "image-text-to-text",
  "max_concurrent_requests": 128,
  "max_best_of": 2,
  "max_stop_sequences": 4,
  "max_input_length": 4095,
  "max_total_tokens": 4096,
  "waiting_served_ratio": 1.2,
  "max_batch_total_tokens": 108112,
  "max_waiting_tokens": 20,
  "max_batch_size": null,
  "validation_workers": 2,
  "version": "2.0.0",
  "sha": "00f365353ea5cf29438ba1d51baadaab79ae4674",
  "docker_label": "sha-00f3653"
}

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

Here is a script that I run on this image with the prompt Describe the image?. Note the image is (286 × 524). It returns an error and the service crashes.

test2

from PIL import Image
import requests
import base64
from io import BytesIO

# fetch image
image = Image.open("test2.jpeg")

# Convert the image to a base64 string
buffer = BytesIO()
image.save(buffer, format="JPEG")  # Use the appropriate format (e.g., JPEG, PNG)
base64_image = base64.b64encode(buffer.getvalue()).decode('utf-8')

# format image string 
image_string = f"data:image/jpeg;base64,{base64_image}"
query = "Describe the image?"
prompt=f"<|im_start|>system\nAnswer the questions.<|im_end|><|im_start|>user\n![]({image_string})\n{query}<|im_end|><|im_start|>assistant\n"

headers = {
    "Accept" : "application/json",
    "Content-Type": "application/json" 
}

payload = {"inputs":prompt}
response = requests.post("endpoint/generate", headers=headers, json=payload)
response.json()

{'error': 'Request failed during generation: Server error: CANCELLED',
 'error_type': 'generation'}

Logs from the tgi service

tgi-llava-1  | 2024-04-20T00:13:55.522584Z ERROR text_generation_launcher: Method Prefill encountered an error.
tgi-llava-1  | Traceback (most recent call last):
tgi-llava-1  |   File "/opt/conda/bin/text-generation-server", line 8, in <module>
tgi-llava-1  |     sys.exit(app())
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
tgi-llava-1  |     return get_command(self)(*args, **kwargs)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
tgi-llava-1  |     return self.main(*args, **kwargs)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
tgi-llava-1  |     return _main(
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
tgi-llava-1  |     rv = self.invoke(ctx)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
tgi-llava-1  |     return _process_result(sub_ctx.command.invoke(sub_ctx))
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
tgi-llava-1  |     return ctx.invoke(self.callback, **ctx.params)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
tgi-llava-1  |     return __callback(*args, **kwargs)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
tgi-llava-1  |     return callback(**use_params)  # type: ignore
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
tgi-llava-1  |     server.serve(
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 240, in serve
tgi-llava-1  |     asyncio.run(
tgi-llava-1  |   File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
tgi-llava-1  |     return loop.run_until_complete(main)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
tgi-llava-1  |     self.run_forever()
tgi-llava-1  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
tgi-llava-1  |     self._run_once()
tgi-llava-1  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
tgi-llava-1  |     handle._run()
tgi-llava-1  |   File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
tgi-llava-1  |     self._context.run(self._callback, *self._args)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
tgi-llava-1  |     return await self.intercept(
tgi-llava-1  | > File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
tgi-llava-1  |     return await response
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
tgi-llava-1  |     raise error
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
tgi-llava-1  |     return await behavior(request_or_iterator, context)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 123, in Prefill
tgi-llava-1  |     generations, next_batch, timings = self.model.generate_token(batch)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
tgi-llava-1  |     return func(*args, **kwds)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 945, in generate_token
tgi-llava-1  |     raise e
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 942, in generate_token
tgi-llava-1  |     out, speculative_logits = self.forward(batch)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/vlm_causal_lm.py", line 285, in forward
tgi-llava-1  |     logits, speculative_logits = self.model.forward(
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/llava_next.py", line 283, in forward
tgi-llava-1  |     inputs_embeds = self._merge_input_ids_with_image_features(
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/llava_next.py", line 183, in _merge_input_ids_with_image_features
tgi-llava-1  |     inputs_embeds[mask] = image_features.view(-1, image_features.shape[-1])
tgi-llava-1  | RuntimeError: shape mismatch: value tensor of shape [1676, 7168] cannot be broadcast to indexing result of shape [2781, 7168]
tgi-llava-1  | 
tgi-llava-1  | 2024-04-20T00:13:55.879011Z ERROR batch{batch_size=1}:prefill:prefill{id=0 size=1}:prefill{id=0 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: CANCELLED
tgi-llava-1  | 2024-04-20T00:13:56.509685Z ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(0)}:clear_cache{batch_id=Some(0)}: text_generation_client: router/client/src/lib.rs:33: Server error: transport error
tgi-llava-1  | 2024-04-20T00:13:56.509723Z ERROR generate{parameters=GenerateParameters { best_of: None, temperature: Some(0.9), repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: Some(0.6), typical_p: None, do_sample: true, max_new_tokens: Some(3704), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }}:generate:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:866: Request failed during generation: Server error: CANCELLED
tgi-llava-1  | 2024-04-20T00:13:56.601488Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
tgi-llava-1  | 
tgi-llava-1  | You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
tgi-llava-1  | Exception ignored in: <function Server.__del__ at 0x7f73512317e0>
tgi-llava-1  | Traceback (most recent call last):
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/grpc/aio/_server.py", line 185, in __del__
tgi-llava-1  |     cygrpc.schedule_coro_threadsafe(
tgi-llava-1  |   File "src/python/grpcio/grpc/_cython/_cygrpc/aio/common.pyx.pxi", line 120, in grpc._cython.cygrpc.schedule_coro_threadsafe
tgi-llava-1  |   File "src/python/grpcio/grpc/_cython/_cygrpc/aio/common.pyx.pxi", line 112, in grpc._cython.cygrpc.schedule_coro_threadsafe
tgi-llava-1  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 436, in create_task
tgi-llava-1  |     self._check_closed()
tgi-llava-1  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 515, in _check_closed
tgi-llava-1  |     raise RuntimeError('Event loop is closed')
tgi-llava-1  | RuntimeError: Event loop is closed
tgi-llava-1  | sys:1: RuntimeWarning: coroutine 'AioServer.shutdown' was never awaited
tgi-llava-1  | Task exception was never retrieved
tgi-llava-1  | future: <Task finished name='Task-12' coro=<<coroutine without __name__>()> exception=SystemExit(1)>
tgi-llava-1  | Traceback (most recent call last):
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
tgi-llava-1  |     return await response
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
tgi-llava-1  |     raise error
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
tgi-llava-1  |     return await behavior(request_or_iterator, context)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 123, in Prefill
tgi-llava-1  |     generations, next_batch, timings = self.model.generate_token(batch)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
tgi-llava-1  |     return func(*args, **kwds)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 945, in generate_token
tgi-llava-1  |     raise e
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 942, in generate_token
tgi-llava-1  |     out, speculative_logits = self.forward(batch)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/vlm_causal_lm.py", line 285, in forward
tgi-llava-1  |     logits, speculative_logits = self.model.forward(
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/llava_next.py", line 283, in forward
tgi-llava-1  |     inputs_embeds = self._merge_input_ids_with_image_features(
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/llava_next.py", line 183, in _merge_input_ids_with_image_features
tgi-llava-1  |     inputs_embeds[mask] = image_features.view(-1, image_features.shape[-1])
tgi-llava-1  | RuntimeError: shape mismatch: value tensor of shape [1676, 7168] cannot be broadcast to indexing result of shape [2781, 7168]
tgi-llava-1  | 
tgi-llava-1  | During handling of the above exception, another exception occurred:
tgi-llava-1  | 
tgi-llava-1  | Traceback (most recent call last):
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
tgi-llava-1  |     return get_command(self)(*args, **kwargs)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
tgi-llava-1  |     return self.main(*args, **kwargs)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
tgi-llava-1  |     return _main(
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
tgi-llava-1  |     rv = self.invoke(ctx)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
tgi-llava-1  |     return _process_result(sub_ctx.command.invoke(sub_ctx))
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
tgi-llava-1  |     return ctx.invoke(self.callback, **ctx.params)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
tgi-llava-1  |     return __callback(*args, **kwargs)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
tgi-llava-1  |     return callback(**use_params)  # type: ignore
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
tgi-llava-1  |     server.serve(
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 240, in serve
tgi-llava-1  |     asyncio.run(
tgi-llava-1  |   File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
tgi-llava-1  |     return loop.run_until_complete(main)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
tgi-llava-1  |     self.run_forever()
tgi-llava-1  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
tgi-llava-1  |     self._run_once()
tgi-llava-1  |   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
tgi-llava-1  |     handle._run()
tgi-llava-1  |   File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
tgi-llava-1  |     self._context.run(self._callback, *self._args)
tgi-llava-1  |   File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 702, in _handle_exceptions
tgi-llava-1  |   File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 689, in grpc._cython.cygrpc._handle_exceptions
tgi-llava-1  |   File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 821, in _handle_rpc
tgi-llava-1  |   File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 554, in _handle_unary_unary_rpc
tgi-llava-1  |   File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 408, in _finish_handler_with_unary_response
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
tgi-llava-1  |     return await self.intercept(
tgi-llava-1  |   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 28, in intercept
tgi-llava-1  |     exit(1)
tgi-llava-1  |   File "/opt/conda/lib/python3.10/_sitebuiltins.py", line 26, in __call__
tgi-llava-1  |     raise SystemExit(code)
tgi-llava-1  | SystemExit: 1 rank=0
tgi-llava-1  | 2024-04-20T00:13:56.632158Z ERROR text_generation_launcher: Shard 0 crashed
tgi-llava-1  | 2024-04-20T00:13:56.632178Z  INFO text_generation_launcher: Terminating webserver
tgi-llava-1  | 2024-04-20T00:13:56.632196Z  INFO text_generation_launcher: Waiting for webserver to gracefully shutdown
tgi-llava-1  | 2024-04-20T00:13:56.632405Z  INFO text_generation_router::server: router/src/server.rs:1504: signal received, starting graceful shutdown
tgi-llava-1  | 2024-04-20T00:13:56.732331Z  INFO text_generation_launcher: webserver terminated
tgi-llava-1  | 2024-04-20T00:13:56.732350Z  INFO text_generation_launcher: Shutting down shards
tgi-llava-1  | Error: ShardFailed
tgi-llava-1 exited with code 1

Expected behavior

When I run the same script on an image that's square (554x554), it behaves as expected.

test

Response

{'generated_text': "The image shows a young dog with a mix of black and brown fur. It has a curious expression, with wide, dark eyes that are turned towards the camera and a slightly tilted head, suggesting attentiveness. The dog's fur appears soft and shiny, and it has a white area on its muzzle and underbelly, which is common in many dog breeds. The background is a plain light color, providing a stark contrast to the dog's dark fur and highlighting its features. The"}

Logs from cgi

tgi-llava-1  | 2024-04-20T01:42:05.198186Z  INFO generate{parameters=GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: true, max_new_tokens: Some(100), return_full_text: None, stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None } total_time="7.207770204s" validation_time="74.92µs" queue_time="39.85µs" inference_time="7.207655694s" time_per_token="72.076556ms" seed="Some(12891300100484859231)"}: text_generation_router::server: router/src/server.rs:310: Success

ktrapeznikov commented 1 week ago

Sometimes it works landscape images of certain sizes. Some times it also crashes. Do images sizes have to be multiples of 336?

shuaills commented 1 week ago

Same problem Method Prefill encountered an error

shuaills commented 1 week ago

It seems that the current implementation counts the tokens generated from the encoded image as part of the prompt length. It might be better to extract the image features first and then calculate the prompt token length separately. I'm not sure if TGI has support for this approach, as it could be quite involved.

MrToy commented 1 week ago

Same issue, only width == height image works

alexgravx commented 16 hours ago

I have the same issue, it seems to be linked to image sizes. I found that some sizes work in TGI v2.0.1 but not in TGI v2.0.2, and inversely.

I made here a recap for image size I tested. Note that the 2-bis image is the 2 image cropped, to ensure that the dimension is causing the issue.

Image	dimension	ratio L/W	works in v2.0.1	works in v2.0.2
1	450 x 299	1.505	No	Yes
2	800 x 531	1.506	Yes	No
2 bis	450 x 299	1.505	No	Yes
3	300 x 168	1.785	No	Yes
4	640 x 480	1. 333	Yes	Yes
5	934 x 934 (square)	1	Yes	Yes

When the image hasn't the right dimension, the server encounters an error and crashes. Here are the logs I get:

v2.0.1 (image 1 crash)

ERROR text_generation_launcher: Method Prefill encountered an error.
...
RuntimeError: shape mismatch: value tensor of shape [1464, 4096] cannot be broadcast to indexing result of shape [1376, 4096]
...
ERROR batch{batch_size=1}:prefill:prefill{id=0 size=1}:prefill{id=0 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: CANCELLED
ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(0)}:clear_cache{batch_id=Some(0)}: text_generation_client: router/client/src/lib.rs:33: Server error: transport error
ERROR chat_completions:generate:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:866: Request failed during generation: Server error: CANCELLED
...
ERROR text_generation_launcher: Shard 0 crashed

v2.0.2 (image 2 crash, not happening at warmup)

INFO text_generation_launcher: Found 2095 in image of resolution 531x800
ERROR text_generation_launcher: Method Prefill encountered an error.
...
RuntimeError: shape mismatch: value tensor of shape [2144, 4096] cannot be broadcast to indexing result of shape [2095, 4096]
...
RuntimeError: Cannot fill images right now. If error happens at warmup, make sure you have enough `--max-input-tokens`  to handle images. If error happens at regular runtime, please fill in an issue: shape mismatch: value tensor of shape [2144, 4096] cannot be broadcast to indexing result of shape [2095, 4096]
...
ERROR batch{batch_size=1}:prefill:prefill{id=0 size=1}:prefill{id=0 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: CANCELLED
ERROR batch{batch_size=1}:prefill:clear_cache{batch_id=Some(0)}:clear_cache{batch_id=Some(0)}: text_generation_client: router/client/src/lib.rs:33: Server error: transport error
ERROR chat_completions:generate:generate_stream:infer:send_error: text_generation_router::infer: router/src/infer.rs:866: Request failed during generation: Server error: CANCELLED
...
ERROR text_generation_launcher: Shard 0 crashed

My model info

{
    model_id: "llava-hf/llava-v1.6-mistral-7b-hf",
    validation_workers: 2,
    trust_remote_code: false,
    max_concurrent_requests: 128,
    max_best_of: 2,
    max_stop_sequences: 4,
    max_top_n_tokens: 5,
    max_input_tokens: Some(4000),
    max_total_tokens: Some(5000),
    waiting_served_ratio: 0.3,
    max_waiting_tokens: 20,
    hostname: "0.0.0.0",
    port: 80,
    shard_uds_path: "/tmp/text-generation-server",
    master_addr: "localhost",
    master_port: 29500,
    huggingface_hub_cache: Some("/data"),
    disable_custom_kernels: false,
    cuda_memory_fraction: 1.0,
    json_output: false,
    cors_allow_origin: [],
    ngrok: false,
    disable_grammar_support: false,
    env: false,
    max_client_batch_size: 4,
}

huggingface / text-generation-inference