TGI over-reserving memory

RonanKMcGovern commented 9 months ago

System Info

docker image 1.3.0

public runpod template: https://runpod.io/gsc?template=3uvdgyo0yy&ref=jmfkcdio

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

Run the runpod template (which uses a docker image) on an A6000 (48 GB).

The model with run out of memory unless the max prefill and input args are reduced.

2023-12-12T17:39:01.219580598Z 2023-12-12T17:39:01.219356Z  INFO text_generation_launcher: Args { model_id: "TheBloke/Mistral-7B-Instruct-v0.1-AWQ", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: Some(Awq), speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 16000, max_total_tokens: 17000, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 17000, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "74ac0675d36c", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/workspace"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }
2023-12-12T17:39:01.219790367Z 2023-12-12T17:39:01.219601Z  INFO download: text_generation_launcher: Starting download process.
2023-12-12T17:39:04.927447025Z 2023-12-12T17:39:04.927063Z  INFO text_generation_launcher: Download file: model.safetensors
2023-12-12T17:39:04.927473785Z 
2023-12-12T17:40:09.753379430Z 2023-12-12T17:40:09.753117Z  INFO text_generation_launcher: Downloaded /workspace/models--TheBloke--Mistral-7B-Instruct-v0.1-AWQ/snapshots/b2f7c152209c12057c3a0d77b2c01a1def7d594f/model.safetensors in 0:01:04.
2023-12-12T17:40:09.753410520Z 
2023-12-12T17:40:09.753413780Z 2023-12-12T17:40:09.753280Z  INFO text_generation_launcher: Download: [1/1] -- ETA: 0
2023-12-12T17:40:09.753419370Z 
2023-12-12T17:40:10.393293257Z 2023-12-12T17:40:10.392889Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2023-12-12T17:40:10.393995024Z 2023-12-12T17:40:10.393687Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2023-12-12T17:40:18.761341376Z 2023-12-12T17:40:18.761023Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2023-12-12T17:40:18.761385556Z 
2023-12-12T17:40:18.812833166Z 2023-12-12T17:40:18.812535Z  INFO shard-manager: text_generation_launcher: Shard ready in 8.417182597s rank=0
2023-12-12T17:40:18.911536804Z 2023-12-12T17:40:18.911320Z  INFO text_generation_launcher: Starting Webserver
2023-12-12T17:40:19.468962390Z 2023-12-12T17:40:19.468471Z  WARN text_generation_router: router/src/main.rs:349: `--revision` is not set
2023-12-12T17:40:19.468986430Z 2023-12-12T17:40:19.468523Z  WARN text_generation_router: router/src/main.rs:350: We strongly advise to set it to a known supported commit.
2023-12-12T17:40:19.640543608Z 2023-12-12T17:40:19.640105Z  INFO text_generation_router: router/src/main.rs:371: Serving revision b2f7c152209c12057c3a0d77b2c01a1def7d594f of model TheBloke/Mistral-7B-Instruct-v0.1-AWQ
2023-12-12T17:40:19.650018967Z 2023-12-12T17:40:19.649648Z  INFO text_generation_router: router/src/main.rs:213: Warming up model
2023-12-12T17:40:20.834035113Z 2023-12-12T17:40:20.833652Z ERROR text_generation_launcher: Method Warmup encountered an error.
2023-12-12T17:40:20.834062983Z Traceback (most recent call last):
2023-12-12T17:40:20.834066383Z   File "/opt/conda/bin/text-generation-server", line 8, in <module>
2023-12-12T17:40:20.834070143Z     sys.exit(app())
2023-12-12T17:40:20.834072863Z   File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
2023-12-12T17:40:20.834096823Z     return get_command(self)(*args, **kwargs)
2023-12-12T17:40:20.834099243Z   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
2023-12-12T17:40:20.834101153Z     return self.main(*args, **kwargs)
2023-12-12T17:40:20.834103053Z   File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
2023-12-12T17:40:20.834105163Z     return _main(
2023-12-12T17:40:20.834107053Z   File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
2023-12-12T17:40:20.834109493Z     rv = self.invoke(ctx)
2023-12-12T17:40:20.834111523Z   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
2023-12-12T17:40:20.834113403Z     return _process_result(sub_ctx.command.invoke(sub_ctx))
2023-12-12T17:40:20.834115553Z   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
2023-12-12T17:40:20.834117373Z     return ctx.invoke(self.callback, **ctx.params)
2023-12-12T17:40:20.834119273Z   File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
2023-12-12T17:40:20.834121163Z     return __callback(*args, **kwargs)
2023-12-12T17:40:20.834123103Z   File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
2023-12-12T17:40:20.834124973Z     return callback(**use_params)  # type: ignore
2023-12-12T17:40:20.834126943Z   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 89, in serve
2023-12-12T17:40:20.834129803Z     server.serve(
2023-12-12T17:40:20.834131753Z   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 215, in serve
2023-12-12T17:40:20.834133773Z     asyncio.run(
2023-12-12T17:40:20.834136113Z   File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
2023-12-12T17:40:20.834138243Z     return loop.run_until_complete(main)
2023-12-12T17:40:20.834140203Z   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
2023-12-12T17:40:20.834142143Z     self.run_forever()
2023-12-12T17:40:20.834144133Z   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
2023-12-12T17:40:20.834145993Z     self._run_once()
2023-12-12T17:40:20.834147893Z   File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
2023-12-12T17:40:20.834149943Z     handle._run()
2023-12-12T17:40:20.834152173Z   File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
2023-12-12T17:40:20.834154323Z     self._context.run(self._callback, *self._args)
2023-12-12T17:40:20.834157393Z   File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
2023-12-12T17:40:20.834159623Z     return await self.intercept(
2023-12-12T17:40:20.834161733Z > File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
2023-12-12T17:40:20.834164603Z     return await response
2023-12-12T17:40:20.834166613Z   File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
2023-12-12T17:40:20.834169453Z     raise error
2023-12-12T17:40:20.834171443Z   File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
2023-12-12T17:40:20.834173433Z     return await behavior(request_or_iterator, context)
2023-12-12T17:40:20.834175483Z   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 72, in Warmup
2023-12-12T17:40:20.834180793Z     max_supported_total_tokens = self.model.warmup(batch)
2023-12-12T17:40:20.834182783Z   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 692, in warmup
2023-12-12T17:40:20.834188203Z     _, batch = self.generate_token(batch)
2023-12-12T17:40:20.834190133Z   File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
2023-12-12T17:40:20.834192263Z     return func(*args, **kwds)
2023-12-12T17:40:20.834194143Z   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 823, in generate_token
2023-12-12T17:40:20.834196053Z     raise e
2023-12-12T17:40:20.834201403Z   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 820, in generate_token
2023-12-12T17:40:20.834203533Z     out = self.forward(batch)
2023-12-12T17:40:20.834205543Z   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 398, in forward
2023-12-12T17:40:20.834207483Z     logits = self.model.forward(
2023-12-12T17:40:20.834209583Z   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 461, in forward
2023-12-12T17:40:20.834211813Z     hidden_states = self.model(
2023-12-12T17:40:20.834213693Z   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
2023-12-12T17:40:20.834224593Z     hidden_states, residual = layer(
2023-12-12T17:40:20.834226703Z   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
2023-12-12T17:40:20.834228753Z     return self._call_impl(*args, **kwargs)
2023-12-12T17:40:20.834230663Z   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
2023-12-12T17:40:20.834232743Z     return forward_call(*args, **kwargs)
2023-12-12T17:40:20.834234633Z   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 349, in forward
2023-12-12T17:40:20.834236553Z     mlp_output = self.mlp(normed_attn_res_output)
2023-12-12T17:40:20.834238613Z   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
2023-12-12T17:40:20.834240763Z     return self._call_impl(*args, **kwargs)
2023-12-12T17:40:20.834242653Z   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
2023-12-12T17:40:20.834244513Z     return forward_call(*args, **kwargs)
2023-12-12T17:40:20.834246633Z   File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 293, in forward
2023-12-12T17:40:20.834248733Z     return self.down_proj(self.act(gate_up_states[:, 0]) * gate_up_states[:, 1])
2023-12-12T17:40:20.834251113Z   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
2023-12-12T17:40:20.834253853Z     return self._call_impl(*args, **kwargs)
2023-12-12T17:40:20.834256083Z   File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
2023-12-12T17:40:20.834258163Z     return forward_call(*args, **kwargs)
2023-12-12T17:40:20.834260083Z   File "/opt/conda/lib/python3.10/site-packages/transformers/activations.py", line 150, in forward
2023-12-12T17:40:20.834262323Z     return nn.functional.silu(input)
2023-12-12T17:40:20.834264223Z   File "/opt/conda/lib/python3.10/site-packages/torch/nn/functional.py", line 2072, in silu
2023-12-12T17:40:20.834266133Z     return torch._C._nn.silu(input)
2023-12-12T17:40:20.834267993Z RuntimeError: CUDA error: an illegal memory access was encountered
2023-12-12T17:40:20.834270053Z CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
2023-12-12T17:40:20.834271913Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
2023-12-12T17:40:20.834273783Z Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2023-12-12T17:40:20.834275823Z 
2023-12-12T17:40:20.834277673Z 
2023-12-12T17:40:20.835752858Z 2023-12-12T17:40:20.835310Z ERROR warmup{max_input_length=16000 max_prefill_tokens=17000 max_total_tokens=17000}:warmup: text_generation_client: router/client/src/lib.rs:33: Server error: Unexpected <class 'RuntimeError'>: CUDA error: an illegal memory access was encountered
2023-12-12T17:40:20.835805018Z CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
2023-12-12T17:40:20.835810738Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
2023-12-12T17:40:20.835815888Z Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2023-12-12T17:40:20.835819928Z 
2023-12-12T17:40:20.849624482Z Error: Warmup(Generation("Unexpected <class 'RuntimeError'>: CUDA error: an illegal memory access was encountered\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n"))
2023-12-12T17:40:20.915658646Z 2023-12-12T17:40:20.915381Z ERROR text_generation_launcher: Webserver Crashed
2023-12-12T17:40:20.915689526Z 2023-12-12T17:40:20.915439Z  INFO text_generation_launcher: Shutting down shards
2023-12-12T17:40:21.116013351Z 2023-12-12T17:40:21.115718Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=0

Expected behavior

The same model, on an A6000, can run with 32,000 tokens of input on vLLM. So, the GPU is capable.
Possibly TGI is over-reserving memory (or, maybe paged attention is implemented differently)?

It would be great if this could be addressed because TGI is faster than vLLM for longer contexts (possibly because of flash decoding?). But this benefit can't be brought to bear if there's OOM when setting a long context.

OlivierDehaene commented 9 months ago

That's not a OOM error though. I will look into why AWQ is throwing an indexing error. BTW you do not need quantization if you have so much VRAM at your disposal.

RonanKMcGovern commented 9 months ago

Thanks!

So the kv cache is stored in bf16 then on the GPU? Not quantized.

Even still, using quantization should leave more space for more kv cache for more context length? Or am I misunderstanding?

github-actions[bot] commented 8 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

QLutz commented 8 months ago

Is this still being investigated ?

Narsil commented 8 months ago

We unfortunately are not able to reproduce, and we don't have A600 to test it on really.

Launching with CUDA_LAUNCH_BLOCKING=1 should help diagnose a bit better (in all likelihood it's AWQ tha't s causing the issue).

It's probably linked to compute_cap < 7.5 tbh, which is going to be hard to fix. Using a different quantized (GPTQ, EETQ, BNB) or non quantized, should help there. Without reproducing it's hard to fix though.

huggingface / text-generation-inference