Open akowalsk opened 6 months ago
Likely related to this #1777
I've also encountered that problem, but the length limit exceeded thing also happens on the idefics-9b-instruct model. That model works with images of varying dimensionality, but still fails when the image is large (over 1MB).
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
I will revalidate on the latest TGI version shortly.
I tried this again with the latest version and the idefics-8b-chatty model instead of the llava model and the issue persists.
Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.78.0
Commit sha: f426a3398d12808f20c101487329e563d32bfbaf
Docker label: sha-f426a33
nvidia-smi:
Fri Jun 21 20:35:18 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 On | 00000000:01:00.0 Off | N/A |
| 0% 30C P8 18W / 350W | 15380MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 On | 00000000:21:00.0 Off | N/A |
| 0% 30C P8 22W / 350W | 15380MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
model info
{
"model_id": "/opt/ml/checkpoint/idefics2-8b-chatty",
"model_sha": null,
"model_dtype": "torch.float16",
"model_device_type": "cuda",
"model_pipeline_tag": null,
"max_concurrent_requests": 128,
"max_best_of": 2,
"max_stop_sequences": 4,
"max_input_length": 24576,
"max_total_tokens": 32768,
"waiting_served_ratio": 0.3,
"max_batch_total_tokens": 192080,
"max_waiting_tokens": 20,
"max_batch_size": null,
"validation_workers": 2,
"max_client_batch_size": 4,
"router": "text-generation-router",
"version": "2.0.4",
"sha": "f426a3398d12808f20c101487329e563d32bfbaf",
"docker_label": "sha-f426a33"
}
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
I tried to replicate this on the latest TGI version (2.2) and ended up with a different error:
{"timestamp":"2024-07-25T17:50:30.156102Z","level":"ERROR","message":"Server error: 'Tensor' object has no attribute 'input_lengths'","target":"text_generation_client","filename":"router/client/src/lib.rs","line_number":46,"span":{"size":1,"name":"decode"},"spans":[{"batch_size":1,"name":"batch"},{"name":"decode"},{"size":1,"name":"decode"},{"size":1,"name":"decode"}]}
{"timestamp":"2024-07-25T17:50:30.149213Z","level":"ERROR","fields":{"message":"Method Decode encountered an error.\nTraceback (most recent call last):\n File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n sys.exit(app())\n File \"/opt/conda/lib/python3.10/site-packages/typer/main.py\", line 309, in __call__\n return get_command(self)(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 1157, in __call__\n return self.main(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/typer/core.py\", line 723, in main\n return _main(\n File \"/opt/conda/lib/python3.10/site-packages/typer/core.py\", line 193, in _main\n rv = self.invoke(ctx)\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 1688, in invoke\n return _process_result(sub_ctx.command.invoke(sub_ctx))\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 1434, in invoke\n return ctx.invoke(self.callback, **ctx.params)\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 783, in invoke\n return __callback(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/typer/main.py\", line 692, in wrapper\n return callback(**use_params)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py\", line 118, in serve\n server.serve(\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 297, in serve\n asyncio.run(\n File \"/opt/conda/lib/python3.10/asyncio/runners.py\", line 44, in run\n return loop.run_until_complete(main)\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 636, in run_until_complete\n self.run_forever()\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 603, in run_forever\n self._run_once()\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 1909, in _run_once\n handle._run()\n File \"/opt/conda/lib/python3.10/asyncio/events.py\", line 80, in _run\n self._context.run(self._callback, *self._args)\n File \"/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py\", line 165, in invoke_intercept_method\n return await self.intercept(\n> File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py\", line 21, in intercept\n return await response\n File \"/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py\", line 120, in _unary_interceptor\n raise error\n File \"/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py\", line 111, in _unary_interceptor\n return await behavior(request_or_iterator, context)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 183, in Decode\n generations, next_batch, timings = self.model.generate_token(batch)\n File \"/opt/conda/lib/python3.10/contextlib.py\", line 79, in inner\n return func(*args, **kwds)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py\", line 1376, in generate_token\n out, speculative_logits = self.forward(batch, adapter_data)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/vlm_causal_lm.py\", line 351, in forward\n logits, speculative_logits = self.model.forward(\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/idefics2.py\", line 824, in forward\n hidden_states = self.text_model.model(\n File \"/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1532, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1541, in _call_impl\n return forward_call(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py\", line 447, in forward\n hidden_states, residual = layer(\n File \"/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1532, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1541, in _call_impl\n return forward_call(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py\", line 372, in forward\n attn_output = self.self_attn(\n File \"/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1532, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1541, in _call_impl\n return forward_call(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py\", line 235, in forward\n attn_output = paged_attention(\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/attention/cuda.py\", line 116, in paged_attention\n input_lengths = seqlen.input_lengths\nAttributeError: 'Tensor' object has no attribute 'input_lengths'"},"target":"text_generation_launcher"}
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Still experiencing the issue.
Also experiencing this issue when running with this model.
System Info
text-generation-launcher --env
model info
Information
Tasks
Reproduction
Use an image that is greater than 1MB, set IMAGE_PATH and API_ENDPOINT appropriately:
this will print :
Failed to buffer the request body: length limit exceeded
If using an image less than 1MB, it generates correctly.
Expected behavior
It should generate text for the image as long as it fits within the model's context. Based on the text of the error, it looks like it has something to do with the default body size in Axum based on the similarity to https://github.com/tokio-rs/axum/issues/1652.