huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
9.02k stars 1.06k forks source link

Error "Failed to buffer the request body: length limit exceeded" when supplying base64 encoded images greater than 1MB in prompt #1802

Open akowalsk opened 6 months ago

akowalsk commented 6 months ago

System Info

text-generation-launcher --env

Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.75.0
Commit sha: 2d0a7173d4891e7cd5f9b77f8e0987b82a339e51
Docker label: sha-2d0a717
nvidia-smi:
Wed Apr 24 19:58:49 2024
   +-----------------------------------------------------------------------------------------+
   | NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
   |-----------------------------------------+------------------------+----------------------+
   | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
   |                                         |                        |               MIG M. |
   |=========================================+========================+======================|
   |   0  NVIDIA GeForce RTX 3090        On  |   00000000:01:00.0 Off |                  N/A |
   |  0%   23C    P8             16W /  350W |    8450MiB /  24576MiB |      0%      Default |
   |                                         |                        |                  N/A |
   +-----------------------------------------+------------------------+----------------------+
   |   1  NVIDIA GeForce RTX 3090        On  |   00000000:21:00.0 Off |                  N/A |
   |  0%   25C    P8             20W /  350W |    8418MiB /  24576MiB |      0%      Default |
   |                                         |                        |                  N/A |
   +-----------------------------------------+------------------------+----------------------+

model info

{
  "model_id": "/opt/ml/checkpoint/llava-v1.6-mistral-7b-hf",
  "model_sha": null,
  "model_dtype": "torch.float16",
  "model_device_type": "cuda",
  "model_pipeline_tag": null,
  "max_concurrent_requests": 128,
  "max_best_of": 2,
  "max_stop_sequences": 4,
  "max_input_length": 24576,
  "max_total_tokens": 32768,
  "waiting_served_ratio": 1.2,
  "max_batch_total_tokens": 65536,
  "max_waiting_tokens": 20,
  "max_batch_size": null,
  "validation_workers": 2,
  "max_client_batch_size": 4,
  "version": "2.0.1",
  "sha": "2d0a7173d4891e7cd5f9b77f8e0987b82a339e51",
  "docker_label": "sha-2d0a717"
}

Information

Tasks

Reproduction

Use an image that is greater than 1MB, set IMAGE_PATH and API_ENDPOINT appropriately:

from PIL import Image
import requests
import base64
from io import BytesIO

# fetch image
image = Image.open(IMAGE_PATH)

# Convert the image to a base64 string
buffer = BytesIO()
image.save(buffer, format="PNG")  # Use the appropriate format (e.g., JPEG, PNG)
base64_image = base64.b64encode(buffer.getvalue()).decode('utf-8')

# format image string 
image_string = f"data:image/png;base64,{base64_image}"
query = "Describe the image?"
prompt=f"[INST] ![]({image_string})\n{query} [/INST]"

headers = {
    "Accept" : "application/json",
    "Content-Type": "application/json" 
}

payload = {"inputs":prompt}
response = requests.post(f"{API_ENDPOINT}/generate", headers=headers, json=payload)
try:
    print(response.json())
except:
    print(response.text)

this will print : Failed to buffer the request body: length limit exceeded

If using an image less than 1MB, it generates correctly.

Expected behavior

It should generate text for the image as long as it fits within the model's context. Based on the text of the error, it looks like it has something to do with the default body size in Axum based on the similarity to https://github.com/tokio-rs/axum/issues/1652.

ktrapeznikov commented 6 months ago

Likely related to this #1777

akowalsk commented 6 months ago

I've also encountered that problem, but the length limit exceeded thing also happens on the idefics-9b-instruct model. That model works with images of varying dimensionality, but still fails when the image is large (over 1MB).

github-actions[bot] commented 5 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

akowalsk commented 5 months ago

I will revalidate on the latest TGI version shortly.

akowalsk commented 4 months ago

I tried this again with the latest version and the idefics-8b-chatty model instead of the llava model and the issue persists.

Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.78.0
Commit sha: f426a3398d12808f20c101487329e563d32bfbaf
Docker label: sha-f426a33
nvidia-smi:
Fri Jun 21 20:35:18 2024
   +-----------------------------------------------------------------------------------------+
   | NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
   |-----------------------------------------+------------------------+----------------------+
   | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
   |                                         |                        |               MIG M. |
   |=========================================+========================+======================|
   |   0  NVIDIA GeForce RTX 3090        On  |   00000000:01:00.0 Off |                  N/A |
   |  0%   30C    P8             18W /  350W |   15380MiB /  24576MiB |      0%      Default |
   |                                         |                        |                  N/A |
   +-----------------------------------------+------------------------+----------------------+
   |   1  NVIDIA GeForce RTX 3090        On  |   00000000:21:00.0 Off |                  N/A |
   |  0%   30C    P8             22W /  350W |   15380MiB /  24576MiB |      0%      Default |
   |                                         |                        |                  N/A |
   +-----------------------------------------+------------------------+----------------------+

   +-----------------------------------------------------------------------------------------+
   | Processes:                                                                              |
   |  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
   |        ID   ID                                                               Usage      |
   |=========================================================================================|
   +-----------------------------------------------------------------------------------------+

model info

{
    "model_id": "/opt/ml/checkpoint/idefics2-8b-chatty",
    "model_sha": null,
    "model_dtype": "torch.float16",
    "model_device_type": "cuda",
    "model_pipeline_tag": null,
    "max_concurrent_requests": 128,
    "max_best_of": 2,
    "max_stop_sequences": 4,
    "max_input_length": 24576,
    "max_total_tokens": 32768,
    "waiting_served_ratio": 0.3,
    "max_batch_total_tokens": 192080,
    "max_waiting_tokens": 20,
    "max_batch_size": null,
    "validation_workers": 2,
    "max_client_batch_size": 4,
    "router": "text-generation-router",
    "version": "2.0.4",
    "sha": "f426a3398d12808f20c101487329e563d32bfbaf",
    "docker_label": "sha-f426a33"
}
github-actions[bot] commented 3 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

akowalsk commented 3 months ago

I tried to replicate this on the latest TGI version (2.2) and ended up with a different error:

{"timestamp":"2024-07-25T17:50:30.156102Z","level":"ERROR","message":"Server error: 'Tensor' object has no attribute 'input_lengths'","target":"text_generation_client","filename":"router/client/src/lib.rs","line_number":46,"span":{"size":1,"name":"decode"},"spans":[{"batch_size":1,"name":"batch"},{"name":"decode"},{"size":1,"name":"decode"},{"size":1,"name":"decode"}]}
{"timestamp":"2024-07-25T17:50:30.149213Z","level":"ERROR","fields":{"message":"Method Decode encountered an error.\nTraceback (most recent call last):\n File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n sys.exit(app())\n File \"/opt/conda/lib/python3.10/site-packages/typer/main.py\", line 309, in __call__\n return get_command(self)(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 1157, in __call__\n return self.main(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/typer/core.py\", line 723, in main\n return _main(\n File \"/opt/conda/lib/python3.10/site-packages/typer/core.py\", line 193, in _main\n rv = self.invoke(ctx)\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 1688, in invoke\n return _process_result(sub_ctx.command.invoke(sub_ctx))\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 1434, in invoke\n return ctx.invoke(self.callback, **ctx.params)\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 783, in invoke\n return __callback(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/typer/main.py\", line 692, in wrapper\n return callback(**use_params)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py\", line 118, in serve\n server.serve(\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 297, in serve\n asyncio.run(\n File \"/opt/conda/lib/python3.10/asyncio/runners.py\", line 44, in run\n return loop.run_until_complete(main)\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 636, in run_until_complete\n self.run_forever()\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 603, in run_forever\n self._run_once()\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 1909, in _run_once\n handle._run()\n File \"/opt/conda/lib/python3.10/asyncio/events.py\", line 80, in _run\n self._context.run(self._callback, *self._args)\n File \"/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py\", line 165, in invoke_intercept_method\n return await self.intercept(\n> File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py\", line 21, in intercept\n return await response\n File \"/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py\", line 120, in _unary_interceptor\n raise error\n File \"/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py\", line 111, in _unary_interceptor\n return await behavior(request_or_iterator, context)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 183, in Decode\n generations, next_batch, timings = self.model.generate_token(batch)\n File \"/opt/conda/lib/python3.10/contextlib.py\", line 79, in inner\n return func(*args, **kwds)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py\", line 1376, in generate_token\n out, speculative_logits = self.forward(batch, adapter_data)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/vlm_causal_lm.py\", line 351, in forward\n logits, speculative_logits = self.model.forward(\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/idefics2.py\", line 824, in forward\n hidden_states = self.text_model.model(\n File \"/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1532, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1541, in _call_impl\n return forward_call(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py\", line 447, in forward\n hidden_states, residual = layer(\n File \"/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1532, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1541, in _call_impl\n return forward_call(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py\", line 372, in forward\n attn_output = self.self_attn(\n File \"/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1532, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py\", line 1541, in _call_impl\n return forward_call(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py\", line 235, in forward\n attn_output = paged_attention(\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/attention/cuda.py\", line 116, in paged_attention\n input_lengths = seqlen.input_lengths\nAttributeError: 'Tensor' object has no attribute 'input_lengths'"},"target":"text_generation_launcher"}
github-actions[bot] commented 2 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

akowalsk commented 2 months ago

Still experiencing the issue.

giladd123 commented 1 month ago

Also experiencing this issue when running with this model.