triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.3k stars 1.48k forks source link

python backend: cuDNN error: CUDNN_STATUS_MAPPING_ERROR and following CUDA error: an illegal memory access was encountered #5779

Open taoye114 opened 1 year ago

taoye114 commented 1 year ago

Description We use tritonserver with python backend to deploy a customized stable-diffusion model running on pytorch.

We have discovered that our tritonserver has the non-determinist bug:

  1. first its throws an cudnn error:
in _conv_forward\n return F.conv2d(input, weight, bias, self.stride,
RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERROR"
  1. and the following requests would encounter cuda illegal mem access error
    
    return torch._C._cuda_synchronize()

RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.



**Triton Information**
we are using tritonserver:21.07-py3 container but a self pip installed pytorch version: 2.0.0+cu117

**To Reproduce**
This seems an indeterministic bug, we do not have a case to reproduce it.
Jack47 commented 1 year ago

currently we don't find a way to fix it. we want to known how to mitigate this error:

  1. how to let readiness probe find this situation? and let k8s stop sending requests?
  2. any other suggestions?
tanmayv25 commented 1 year ago

This issue looks similar to #5765. The frameworks like pytorch does not validate the tensor values before sending it on execution. For a model that doesn't handle unexpected values(out of bounds indices for example), this can lead to corruption in cuda context and hence a sticky error which can fail other inference executions as well. There is a single cuda context per Triton process. The solution is to ensure the model has proper handling of the tensor data which might overflow the indices.

We can investigate whether we can detect a corrupted cuda context and mark the server down.

Jack47 commented 1 year ago

any updates on this ? we really need this

tanmayv25 commented 1 year ago

We have discovered that our tritonserver has the non-determinist bug.

@taoye114 Can you help us getting a minimal reproducer? Most likely the issue would appear when the model is fed an invalid data. Can you check if the bug can be made deterministic by providing the exact same set of data to the model. @Jack47 If my hunch is correct then the model should be modified to handle the invalid data so that it can map to proper index. This would allow all the operations in pytorch to run without an issue. As a part of longer term goal, Triton may add some cuda context health check, to update the model readiness.

taoye114 commented 1 year ago

@tanmayv25 hi, seems there is a long way to go before triton can officially handle this cuda context thing.

Would you give us some hint, or perhaps some beta code for use to try?

  1. a best practice to check invalid data? cause our online model is rather complex and thus impossible to check input line by line.
  2. any way to detect corrupted cuda context and how to maybe replace it?
tanmayv25 commented 1 year ago
  1. a best practice to check invalid data? cause our online model is rather complex and thus impossible to check input line by line.

You can catch the CUDNN exception and dump the request input data to a file within model.py implementation itself. Then you can try running the same scenario outside Triton in a separate python script. Why the data is considered invalid depends highly on the model architecture.

  1. any way to detect corrupted cuda context and how to maybe replace it?

The only way to fix such sticky cuda context error is to restart the application (which is triton in this case).

Additionally, I found similar issue on pytorch: https://github.com/pytorch/pytorch/issues/27588 It might be resolved by using the correct versions of pytorch and cuda?

MatthieuToulemont commented 1 year ago

I have recently had the same issue, having a proper management of sticky errors / cuda errors would be tremendous

troycheng commented 2 weeks ago

Still have this problem in 24.06 version, server not exit when met an unrecoverable backend