RuntimeError: CUDA error: device-side assert triggered on A4000

Cyberes commented 1 year ago

System Info

Version: ghcr.io/huggingface/text-generation-inference:latest GPU: Nvidia A4000 Ubuntu 22.04

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

Start the Docker container on a host with an Nvidia A4000.

Send this specific prompt:

{
"inputs": "What is Deep Learning?",
"parameters": {
"max_new_tokens": 1024,
"repetition_penalty": 1.1,
"seed": null,
"stop": ["\nUser:", "\n### Instruction:", "\n### 2 to 3 Paragraph Response (engaging, natural, authentic, descriptive, creative, dialog):", "\nundefined"],
"temperature": 0.7,
"top_k": 100,
"top_p": 0.95,
"typical_p": 0.999,
"watermark": false,
"do_sample": true,
"return_full_text": false,
"details": true
}
}

I've noticed that if it doesn't crash on the first specific request, it will crash on the second.

I spun up an A4000 with the TGI Docker container on Vast and got the same error, so it isn't just my machine.

Watch it crash.

Logs

Here's the CUDA_LAUNCH_BLOCKING=1 debug:

Command Ouput

``` 2023-08-30T17:15:17.762274Z ERROR text_generation_launcher: Method Decode encountered an error. Traceback (most recent call last): File "/opt/conda/bin/text-generation-server", line 8, in sys.exit(app()) File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__ return get_command(self)(*args, **kwargs) File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main return _main( File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main rv = self.invoke(ctx) File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper return callback(**use_params) # type: ignore File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 81, in serve server.serve( File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 195, in serve asyncio.run( File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete self.run_forever() File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever self._run_once() File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once handle._run() File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args) File "/opt/conda/lib/python3.9/site-packages/grpc_interceptor/server.py", line 159, in invoke_intercept_method return await self.intercept( > File "/opt/conda/lib/python3.9/site-packages/text_generation_server/interceptor.py", line 21, in intercept return await response File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor raise error File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor return await behavior(request_or_iterator, context) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 109, in Decode generations, next_batch = self.model.generate_token(batch) File "/opt/conda/lib/python3.9/contextlib.py", line 79, in inner return func(*args, **kwds) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_causal_lm.py", line 834, in generate_token next_input_ids, next_token_logprobs = batch.next_token_chooser( File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/tokens.py", line 229, in __call__ scores = warper(input_ids, scores) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/logits_process.py", line 352, in __call__ sorted_indices_to_remove = sorted_scores > sorted_scores.gather( RuntimeError: CUDA error: device-side assert triggered Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. 2023-08-30T17:15:17.764541Z ERROR batch{batch_size=1}:decode:decode{size=1}:decode{size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: Unexpected : CUDA error: device-side assert triggered Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. 2023-08-30T17:15:17.770436Z ERROR text_generation_launcher: Method ClearCache encountered an error. Traceback (most recent call last): File "/opt/conda/bin/text-generation-server", line 8, in sys.exit(app()) File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__ return get_command(self)(*args, **kwargs) File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main return _main( File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main rv = self.invoke(ctx) File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper return callback(**use_params) # type: ignore File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 81, in serve server.serve( File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 195, in serve asyncio.run( File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete self.run_forever() File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever self._run_once() File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once handle._run() File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args) File "/opt/conda/lib/python3.9/site-packages/grpc_interceptor/server.py", line 159, in invoke_intercept_method return await self.intercept( > File "/opt/conda/lib/python3.9/site-packages/text_generation_server/interceptor.py", line 21, in intercept return await response File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor raise error File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor return await behavior(request_or_iterator, context) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 43, in ClearCache self.cache.delete(request.id) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cache.py", line 26, in delete torch.cuda.empty_cache() File "/opt/conda/lib/python3.9/site-packages/torch/cuda/memory.py", line 133, in empty_cache torch._C._cuda_emptyCache() RuntimeError: CUDA error: device-side assert triggered Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. 2023-08-30T17:15:17.772397Z ERROR batch{batch_size=1}:decode:clear_cache{batch_id=Some(0)}:clear_cache{batch_id=Some(0)}: text_generation_client: router/client/src/lib.rs:33: Server error: Unexpected : CUDA error: device-side assert triggered Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. 2023-08-30T17:15:17.772506Z ERROR HTTP request{otel.name=POST /generate http.client_ip= http.flavor=1.1 http.host=172.0.2.140:6000 http.method=POST http.route=/generate http.scheme=HTTP http.target=/generate http.user_agent=curl/7.81.0 otel.kind=server trace_id=0efa0fdf358e997cf7c1e3f4bb1c1d0f}:generate{parameters=GenerateParameters { best_of: None, temperature: Some(0.7), repetition_penalty: Some(1.1), top_k: Some(100), top_p: Some(0.95), typical_p: Some(0.999), do_sample: true, max_new_tokens: 1024, return_full_text: Some(false), stop: ["\nUser:", "\n### Instruction:", "\n### 2 to 3 Paragraph Response (engaging, natural, authentic, descriptive, creative, dialog):", "\nundefined"], truncate: None, watermark: false, details: true, decoder_input_details: false, seed: None }}:generate{request=GenerateRequest { inputs: "What is Deep Learning?", parameters: GenerateParameters { best_of: None, temperature: Some(0.7), repetition_penalty: Some(1.1), top_k: Some(100), top_p: Some(0.95), typical_p: Some(0.999), do_sample: true, max_new_tokens: 1024, return_full_text: Some(false), stop: ["\nUser:", "\n### Instruction:", "\n### 2 to 3 Paragraph Response (engaging, natural, authentic, descriptive, creative, dialog):", "\nundefined"], truncate: None, watermark: false, details: true, decoder_input_details: false, seed: None } }}:generate_stream{request=GenerateRequest { inputs: "What is Deep Learning?", parameters: GenerateParameters { best_of: None, temperature: Some(0.7), repetition_penalty: Some(1.1), top_k: Some(100), top_p: Some(0.95), typical_p: Some(0.999), do_sample: true, max_new_tokens: 1024, return_full_text: Some(false), stop: ["\nUser:", "\n### Instruction:", "\n### 2 to 3 Paragraph Response (engaging, natural, authentic, descriptive, creative, dialog):", "\nundefined"], truncate: None, watermark: false, details: true, decoder_input_details: false, seed: None } }}:infer:send_error: text_generation_router::infer: router/src/infer.rs:554: Request failed during generation: Server error: Unexpected : CUDA error: device-side assert triggered Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. ```

Cyberes commented 1 year ago

This is a continuation of https://github.com/huggingface/text-generation-inference/issues/929.

Cyberes commented 1 year ago

After looking at the HeterogeneousTypicalLogitsWarper class, I set typical_p to a lower value (0.998 instead of 0.999) and the server did not crash. My thoughts is that when typical_p is very close to 1, the warper is trying to keep only the most typical tokens but there might not be enough tokens left to generate a new token, especially if max_new_tokens is large. Thus, the tensor gather() fails.

So, I guess keep typical_p below 0.998 and everything works?

Cyberes commented 1 year ago

The main concern is that once the server hits this runtime error, all subsequent requests also trigger this error (even if the parameters are acceptable). Could the server maybe gracefully handle this so that it doesn't need to be restarted whenever a bad parameter is sent?

Narsil commented 1 year ago

This would be super nice indeed.

However device-side assert will always spoil the current process rendering the process unrecoverable. The best thing we could do is prevent that from happening altogether.

That will likely involve doing some check on the tensor. As long as the check stays on CPU (like counting numel() ) then it would be very easy to do. Anything that would hit the GPU would slow everything down for when things DO work, which we desperately try to avoid.

Unfortunately, I don't have access to A4000 and cannot reproduce on the cards I do have access to. It could also be a bug in triton here (since this models uses triton and not exllama).

Cyberes commented 1 year ago

Vast.ai has good prices, I think $5 will get you over a day of A4000 runtime. Maybe HF could sponsor?