Open dboshardy opened 2 years ago
Thanks for reporting. Yet we need more information to start investigating. Could you please give more details ?
Thanks for reporting. Yet we need more information to start investigating. Could you please give more details ?
I'm happy to help, but you'll have to be more specific. What details would you need?
InferenceSession supports concurrent Run calls as the operator kernels used to execute the model are stateless. There's no internal waiting for existing requests to complete before starting the next one.
FWIW the execution should be deterministic. Are you able to capture the input used in a failed request to see if it fails every time?
According to the error it's coming from an InstanceNormalization node and not a Conv node. There's not much use of SafeInt in the InstanceNormalization kernel. Only places I could see were around checking the total size of the input tensor. Without seeing the model it's hard to say what else could be off.
Any chance the InstanceNormalization node is early in the model and is consuming a model input that is perhaps being overwritten outside of ORT? We treat model inputs as constant and do NOT copy them. Due to that, if you modify the buffer externally whilst the request is running you could affect nodes that consume model inputs.
@scottmckay Apologies for the late reply. Yes, I have capture inputs that error and there is no determinism when it comes to these errors. The same image put through the same model will error under heavy load but under little load, no error.
I've also noticed the behavior where, once the model errors in this manner, reloading the model by deleting the inference session (del
in python in this instance) and creating a new one all subsequent requests, at least that I can tell, seem to error in the same manner.
The error happens in different nodes each time it initially pops up. It typically occurs in one of the many Conv nodes. This instance was a bit unusual.
It's definitely possible this is a weird interaction under heavy load with the inference session and how uvicorn or FastAPI are handling the concurrency in the docker container.
I have the similar problem System Information
| See the caveats in the documentation: https://pandas.pydata.org/pandas-
| File "/usr/local/lib/python3.8/dist-packages/uvicorn/protocols/http/h11_impl.py", line 373, in
| File "/usr/local/lib/python3.8/dist-packages/uvicorn/middleware/proxy_headers.py", line 75, in
| File "/usr/local/lib/python3.8/dist-
| onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running FusedMatMul node. Name:'/bert/encoder/layer.0/attention/self/MatMul_FusedMatMulAndScale' Status Message: /onnxruntime_src/onnxruntime/core/common/safeint.h:17 static void SafeIntExceptionHandler
`
I'm facing a similar issue when when multiple (2 is usually enough to reproduce) processes of onnxruntime run on the same Docker container.
Curiously this does not happen with all the models, only some -- is there some way to debug this reliably? Could it be connected to the way the model is converted to onnx or how it is being optimised?
I faced a very similar issue, in my case, this happened only when the GPU memory was full.
I'm facing a similar issue when when multiple (2 is usually enough to reproduce) processes of onnxruntime run on the same Docker container. Curiously this does not happen with all the models, only some -- is there some way to debug this reliably? Could it be connected to the way the model is converted to onnx or how it is being optimised?
@zhanghuanrong Faced same issue on v1.14.1 (upgraded from v1.10.0) with ~30 models loaded (some constantly loads and unloads to/from GPU, other models persists on GPU)
onnx runtime error 6: Non-zero status code returned while running Conv node. Name:'/model.8/cv1/conv/Conv' Status Message: /workspace/onnxruntime/onnxruntime/core/common/safeint.h:17 static void SafeIntExceptionHandler<onnxruntime::OnnxRuntimeException>::SafeIntOnOverflow() Integer overflow
This error occurs on a model I don't unload from GPU. And when it happens, inferencing on this specific model will ALWAYS fail afterwards but other models are not affected. However, if I restart everything, this model just works fine (and most time, it works well). It's not easy to reproduce this bug but I saw several other people facing same issues above. Might be due to some rare race condition?
I'm facing a similar issue when when multiple (2 is usually enough to reproduce) processes of onnxruntime run on the same Docker container. Curiously this does not happen with all the models, only some -- is there some way to debug this reliably? Could it be connected to the way the model is converted to onnx or how it is being optimised?
@zhanghuanrong Faced same issue on v1.14.1 (upgraded from v1.10.0) with ~30 models loaded (some constantly loads and unloads to/from GPU, other models persists on GPU)
onnx runtime error 6: Non-zero status code returned while running Conv node. Name:'/model.8/cv1/conv/Conv' Status Message: /workspace/onnxruntime/onnxruntime/core/common/safeint.h:17 static void SafeIntExceptionHandler<onnxruntime::OnnxRuntimeException>::SafeIntOnOverflow() Integer overflow
This error occurs on a model I don't unload from GPU. And when it happens, inferencing on this specific model will ALWAYS fail afterwards but other models are not affected. However, if I restart everything, this model just works fine (and most time, it works well). It's not easy to reproduce this bug but I saw several other people facing same issues above. Might be due to some rare race condition?
+cc @skottmckay
I faced a very similar issue, in my case, this happened only when the GPU memory was full.
It may be related to the GPU memory. I am using the stack of Ubuntu 20.04 gunicorn FastAPI onnxruntime-gpu
8 workers within a container to deploy a model inference service. When workers=8 occupy the GPU memory to its full capacity, the same error occurs. when using workers=7 and leaving some redundant GPU memory, the issue disappears.
I finally realized this problem related to the work load.
Hello everyone, has someone find a solution to this problem? Any updates , please! I am facing the same problem. I am using the stack:
Hello everyone, has someone find a solution to this problem? Any updates , please! I am facing the same problem. I am using the stack:
- Ubuntu 20.04
- triton inference server 23.04
23.06 is still failing.
"[StatusCode.INTERNAL] onnx runtime error 6: Non-zero status code returned while running Conv node. Name:'/model.8/cv1/conv/Conv' Status Message: /workspace/onnxruntime/onnxruntime/core/common/safeint.h:17 static void SafeIntExceptionHandler<onnxruntime::OnnxRuntimeException>::SafeIntOnOverflow() Integer overflow
Hello everyone, has someone find a solution to this problem? Any updates , please! I am facing the same problem. I am using the stack:
- Ubuntu 20.04
- triton inference server 23.04
23.06 is still failing.
"[StatusCode.INTERNAL] onnx runtime error 6: Non-zero status code returned while running Conv node. Name:'/model.8/cv1/conv/Conv' Status Message: /workspace/onnxruntime/onnxruntime/core/common/safeint.h:17 static void SafeIntExceptionHandler<onnxruntime::OnnxRuntimeException>::SafeIntOnOverflow() Integer overflow
Thank you for your response, I will follow the issue you have mentioned
Hello everyone, has someone find a solution to this problem? Any updates , please! I am facing the same problem. I am using the stack:
- Ubuntu 20.04
- triton inference server 23.04
23.06 is still failing.
"[StatusCode.INTERNAL] onnx runtime error 6: Non-zero status code returned while running Conv node. Name:'/model.8/cv1/conv/Conv' Status Message: /workspace/onnxruntime/onnxruntime/core/common/safeint.h:17 static void SafeIntExceptionHandler<onnxruntime::OnnxRuntimeException>::SafeIntOnOverflow() Integer overflow
What is tricky about this bug, is that the model works fine on some images , however it fails on others.
I ran into this error when I tried to inference on too many samples, but that may be a distinct issue.
I tried to repro using a couple of inference sessions with a relatively large model, multiple threads running requests concurrently, and each request cycling through different batch sizes so the state in the Conv implementation is constantly changing due to the different input shape. This was based on Conv being the most commonly reported operator involved, along with GPU memory pressure and concurrency.
I didn't see any SafeIntOnOverflow errors. Eventually it would cause OOM errors with CUDA, but the expected handling kicked in there (CUDNN_STATUS_ALLOC_FAILED or BFCArena::AllocateRawInternal depending on what the memory allocation was using).
It's possible there's some specific attributes of the Conv node required to trigger this, but I need someone to contribute a model that can repro the issue to investigate further.
Not sure if related.
I set ‘ memory.enable_memory_arena_shrinkage’ to ‘cpu:0;gpu:0’ for all inference run.
This bug 100% triggers in 48 hours (with maybe 12 hours high work load). I think i tried at least 5 times.
On Thu, Jan 11, 2024 at 6:32 AM Scott McKay @.***> wrote:
I tried to repro using a couple of inference sessions with a relatively large model, multiple threads running requests concurrently, and each request cycling through different batch sizes so the state in the Conv implementation is constantly changing due to the different input shape. This was based on Conv being the most commonly reported operator involved, along with GPU memory pressure and concurrency.
I didn't see any SafeIntOnOverflow errors. Eventually it would cause OOM errors with CUDA, but the expected handling kicked in there (CUDNN_STATUS_ALLOC_FAILED or BFCArena::AllocateRawInternal depending on what the memory allocation was using).
It's possible there's some specific attributes of the Conv node required to trigger this, but I need someone to contribute a model that can repro the issue to investigate further.
— Reply to this email directly, view it on GitHub https://github.com/microsoft/onnxruntime/issues/12288#issuecomment-1885847322, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDO6NESFPTH5DRMBBZKAU3YN4JG5AVCNFSM54L3E4LKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBYGU4DINZTGIZA . You are receiving this because you commented.Message ID: @.***>
I added the memory shrinkage call but still can't reproduce.
BTW that setting is not intended for usage on every single request (hence it's a RunOption not SessionOption). There is a perf cost to doing it on every request as there is locking involved to do the shrink. You're also causing overhead by constantly freeing/allocating memory as CUDA memory allocation/free is slow.
Recommendation would be to shrink after loading the model and doing a warmup query. If you have concurrent traffic it may also be beneficial to shrink at times as the arena size will be a high-water mark (i.e. size allocated will be for the maximum number of concurrent requests seen).
Note that the shrink behavior also differs based on the extend strategy:
I encountered the same problem:
Caused by: java.lang.Exception: ai.onnxruntime.OrtException: Error code - ORT_RUNTIME_EXCEPTION - message: Non-zero status code returned while running Softmax node. Name:'/xxxxxxxxx/xxxxxxxxxx/xxxxxxxxxxxxx/Softmax' Status Message: /onnxruntime_src/onnxruntime/core/common/safeint.h:17 static void SafeIntExceptionHandler<onnxruntime::OnnxRuntimeException>::SafeIntOnOverflow() Integer overflow
The error is occasional, and it starts to appear when the available memory decreases. Other models work fine in service.
I encountered this error recently:
Exception: D:\a\_work\1\s\onnxruntime\core/common/safeint.h:17 SafeIntExceptionHandler<class onnxruntime::OnnxRuntimeException>::SafeIntOnOverflow Integer overflow
The error happens when I request the output data from the session.
session.Run(Ort::RunOptions{ nullptr }, input_names.data(), input.data(), input_names.size(), output_names_raw_ptr.data(), outputTensors.data(), output_names_raw_ptr.size());
If I don't ask for the output data, then it works fine.
session.Run(Ort::RunOptions{ nullptr }, input_names.data(), input.data(), input_names.size(), output_names_raw_ptr.data(), output_names_raw_ptr.size());
I have solved the problem that has
RUNTIME_EXCEPTION : Non-zero status code returned while running InstanceNormalization node. Name:'InstanceNormalization_15' Status Message: /onnxruntime_src/onnxruntime/core/common/safeint.h:17 static void SafeIntExceptionHandler<onnxruntime::OnnxRuntimeException>::SafeIntOnOverflow() Integer overflow
because what you converted from pth to onnx and onnx to onnxruntime quantization are using different version of onnxruntime:
ie: using 1.14 to convertpth to onnx
and 1.18 to convert onnx to quantilized onnx
are
Describe the bug When running a docker container running uvicorn + fastapi + an ORT inference session with a single model on a single uvicorn worker, handling at most 3 requests at a time, we regularly see errors in the ORT session, exclusively:
RUNTIME_EXCEPTION : Non-zero status code returned while running InstanceNormalization node. Name:'InstanceNormalization_15' Status Message: /onnxruntime_src/onnxruntime/core/common/safeint.h:17 static void SafeIntExceptionHandler<onnxruntime::OnnxRuntimeException>::SafeIntOnOverflow() Integer overflow
The node might not always be the same, sometimes it's one of the many conv nodes in the network.
Urgency Prevents us from going to production
System information
Expected behavior No errors thrown. I was under the impression the model inference session would wait for a given prediction to finish before accepting another.
Additional context I'm wondering if this is arising due to too many inference runs happening concurrently. I assumed based on reading documentation that the InferenceSession would queue up runs, but I think I am mistaken. Looking for clarification on that if possible.