tfcoe commented 4 months ago

System Info

ghcr.io/huggingface/text-generation-inference:2.0.3

Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.75.0
Commit sha: 6073ece4fc2d7180c2057cb49b9ea436463fd52b
Docker label: sha-6073ece
nvidia-smi:
Thu May 23 23:05:40 2024
   +---------------------------------------------------------------------------------------+
   | NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
   |-----------------------------------------+----------------------+----------------------+
   | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
   |                                         |                      |               MIG M. |
   |=========================================+======================+======================|
   |   0  NVIDIA A10G                    On  | 00000000:00:1E.0 Off |                    0 |
   |  0%   29C    P8              16W / 300W |      0MiB / 23028MiB |      0%      Default |
   |                                         |                      |                  N/A |
   +-----------------------------------------+----------------------+----------------------+

   +---------------------------------------------------------------------------------------+
   | Processes:                                                                            |
   |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
   |        ID   ID                                                             Usage      |
   |=======================================================================================|
   |  No running processes found                                                           |
   +---------------------------------------------------------------------------------------+
xpu-smi:
N/A

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

Download custom idefics2 qlora merged model to server

Start tgi using standard docker command


model=/data
volume=~/models/idefics2-8b

docker run --gpus all -p 9000:80 \ -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0.3 \ --model-id $model


idefics2-qlora model files are:

added_tokens.json generation_config.json model-00002-of-00004.safetensors model-00004-of-00004.safetensors preprocessor_config.json special_tokens_map.json tokenizer.json version.txt config.json model-00001-of-00004.safetensors model-00003-of-00004.safetensors model.safetensors.index.json processor_config.json tokenizer_config.json tokenizer.model


### Expected behavior

TGI to launch correctly with an IDEFICS2-8B model that has been finetuned using QLoRA and PEFT weights merged onto the model. 

- Launching the OOTB Idefics2 works as expected from local files and from HF
- Loading the qlora model and running inference using standard transformers works as expected:

Load model

qlora_model = Idefics2ForConditionalGeneration.from_pretrained( config["qlora_model_path"], torch_dtype=torch.float16, device_map="auto", ) qlora_model.eval()


I suspect this could also be a result of how I've merged the qlora model but wanted to check if its a problem with TGI itself.

Observing a tensor shape mismatch on warmup: `shape mismatch: value tensor of shape [64, 4096] cannot be broadcast to indexing result of shape [320, 4096]
`

### Error

2024-05-23T23:05:40.351753Z INFO text_generation_launcher: Default max_input_tokens to 4095 2024-05-23T23:05:40.351756Z INFO text_generation_launcher: Default max_total_tokens to 4096 2024-05-23T23:05:40.351758Z INFO text_generation_launcher: Default max_batch_prefill_tokens to 4145 2024-05-23T23:05:40.351761Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32] 2024-05-23T23:05:40.351841Z INFO download: text_generation_launcher: Starting download process.

2024-05-23T23:05:43.390788Z INFO text_generation_launcher: Files are already present on the host. Skipping download.

2024-05-23T23:05:44.056309Z INFO download: text_generation_launcher: Successfully downloaded weights. 2024-05-23T23:05:44.056478Z INFO shard-manager: text_generation_launcher: Starting shard rank=0 2024-05-23T23:05:51.174995Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0

2024-05-23T23:05:51.264243Z INFO shard-manager: text_generation_launcher: Shard ready in 7.207144986s rank=0 2024-05-23T23:05:51.363988Z INFO text_generation_launcher: Starting Webserver 2024-05-23T23:05:51.451386Z WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '' was expected to have ID '32000' but was given ID 'None' 2024-05-23T23:05:51.451435Z WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '' was expected to have ID '32001' but was given ID 'None' 2024-05-23T23:05:51.451438Z WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '' was expected to have ID '32002' but was given ID 'None' 2024-05-23T23:05:51.451864Z INFO text_generation_router: router/src/main.rs:289: Using config Some(Idefics2(Idefics2)) 2024-05-23T23:05:51.451880Z WARN text_generation_router: router/src/main.rs:298: no pipeline tag found for model /data 2024-05-23T23:05:51.455116Z INFO text_generation_router: router/src/main.rs:317: Warming up model 2024-05-23T23:05:52.283020Z ERROR text_generation_launcher: Method Warmup encountered an error. Traceback (most recent call last): File "/opt/conda/bin/text-generation-server", line 8, in sys.exit(app()) File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in call return get_command(self)(*args, kwargs) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in call return self.main(args, kwargs) File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main return _main( File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main rv = self.invoke(ctx) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke return __callback(args, kwargs) File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper return callback(*use_params) # type: ignore File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve server.serve( File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 253, in serve asyncio.run( File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete self.run_forever() File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever self._run_once() File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once handle._run() File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run self._context.run(self._callback, self._args) File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method return await self.intercept(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept return await response File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor raise error File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor return await behavior(request_or_iterator, context) File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 114, in Warmup max_supported_total_tokens = self.model.warmup(batch) File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causallm.py", line 776, in warmup , batch, _ = self.generate_token(batch) File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner return func(*args, **kwds) File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 966, in generate_token raise e File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 963, in generate_token out, speculative_logits = self.forward(batch) File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/vlm_causal_lm.py", line 326, in forward logits, speculative_logits = self.model.forward( File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/idefics2.py", line 810, in forward inputs_embeds = self._merge_input_ids_with_image_features( File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/idefics2.py", line 725, in _merge_input_ids_with_image_features inputs_embeds[mask] = image_features.view(-1, image_features.shape[-1]) RuntimeError: shape mismatch: value tensor of shape [64, 4096] cannot be broadcast to indexing result of shape [320, 4096]

2024-05-23T23:05:52.398327Z ERROR warmup{max_input_length=4095 max_prefill_tokens=4145 max_total_tokens=4096 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:33: Server error: CANCELLED Error: Warmup(Generation("CANCELLED")) 2024-05-23T23:05:52.500691Z ERROR text_generation_launcher: Webserver Crashed 2024-05-23T23:05:52.500724Z INFO text_generation_launcher: Shutting down shards 2024-05-23T23:05:52.565565Z INFO shard-manager: text_generation_launcher: Terminating shard rank=0 2024-05-23T23:05:52.565742Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=0 2024-05-23T23:05:52.866152Z INFO shard-manager: text_generation_launcher: shard terminated rank=0 Error: WebserverFailed

RMSML commented 4 months ago

@tfcoe Are you setting splitting=False on the processor? Try setting this flag to True before saving it and see if it works.

tfcoe commented 4 months ago

Yes! Rodrigo that's solved it. In hindsight this was very obvious 🤦 thanks so much!