Closed r0l1 closed 7 months ago
would it be possible for you to provide us a repro test case so we can look into it deeper from our end? for 1.16.3 we shouldn't have known concurrency issues. we have tests for multi threaded inference with TensorRT EP.
I can't provide a test case immediately. Our C++ code is wrapped by Go and I can't provide source to everything. However I could provide an onnx model with a fitting image if that helps?
Do you know whether the whole model can be run by TRT EP? or there are multiple partitions and some of them are assigned to CUDA EP or CPU?
I can't provide a test case immediately. Our C++ code is wrapped by Go and I can't provide source to everything. However I could provide an onnx model with a fitting image if that helps?
yes, please provide the onnx model.
Yes, the model can be run by TRT. We first thought, that this is a TensorRT problem and tested it against the TensorRT backend without onnxruntime code.
I uploaded the onnx model and a simple sample image for forwarding. This is a segmentation task and the output must not be 0. I checked the output matrix with a countNonZero function and failed the test, if the image contains only zero values. Switch between the CPU and TensorRT runtime provider and you will see the different behavior. Important: this will only occur with concurrent forwarding.
input:
name: input
type: float32
shape: [1, 3, 80, 128]
shape-format: NCHW
sub-pixel-layout: BGR
mean: {r: 127.5, g: 127.5, b: 127.5}
std: {r: 127.5, g: 127.5, b: 127.5}
output:
name: output
type: int64
shape: [1, 1, 80, 128]
Just tested this against the onnxruntime master code. This issue is still present. Please let me know, if you need help to reproduce this issue.
If you can provide a simple c++ program for us to repro that would be great.
I'm using the "-c"
(number of concurrent run) of onnx_test_runner
(will be in release folder when you build from source) to try to repro the issue.
BTW, could you help try with the TRT EP provider options "trt_cuda_graph_enable" to enable cuda graph to see whether the concurrent issue still exists?
to repro, can we feed all zeros of shape [1,3, 80,128] into input. and if executed concurrently, the output matrix will be zero (when it should be nonzero). is that the correct understanding? (then we don't need the .png image you provided?) some other clarifying questions, do you bind input/output to gpu memory? and do you use cuda streams?
The input image (input.png) is required. Load and resize it to 128x80, apply normalization with mean & std. If all input values are zero, the output will also be zero. Cuda streams are not used and the input matrices are bound to the CPU (host memory).
Here is a part of our forward code.
void ONNXRuntime::forward(const vector<ModelMat>& inputs, vector<ModelMat>& outputs) {
const int inputSize = inputs.size();
const int outputSize = outputs.size();
// Check arguments.
if (inputSize != inputNames_.size()) {
throw std::runtime_error(
"ONNXRuntime::forward: expected " + to_string(inputNames_.size()) + " inputs, got " + to_string(inputSize)
);
} else if (outputSize != outputNames_.size()) {
throw std::runtime_error(
"ONNXRuntime::forward: expected " + to_string(outputNames_.size()) + " outputs, got " + to_string(outputSize)
);
}
// Create the tensors.
const vector<Ort::Value> inputTensors = createTensors_(inputs, inputTypes_);
vector<Ort::Value> outputTensors = createTensors_(outputs, outputTypes_);
// TODO: remove once tensorrt issue is fixed: https://github.com/microsoft/onnxruntime/issues/19275
if (runMxRequired_) {
unique_lock<std::mutex> lock(runMx_);
session_->Run(runOpts_, inputNamesCstr_.data(), inputTensors.data(), inputSize, outputNamesCstr_.data(), outputTensors.data(), outputSize);
return;
}
// Forward pass.
// Mutex is not needed: https://github.com/microsoft/onnxruntime/issues/114
session_->Run(runOpts_, inputNamesCstr_.data(), inputTensors.data(), inputSize, outputNamesCstr_.data(), outputTensors.data(), outputSize);
}
createTensors_
creates onnxruntime tensors (CPU) from OpenCV matrices (shared ref). But that's nothing special and just a lot of code.
Here is our crash testing code in go:
NumCPUs
routines and feed the same image to the model/session (normalization happens during the forward)func init() {
App.AddCommand(&grumble.Command{
Name: "crash",
Help: "crash",
Run: runCrash,
Args: func(a *grumble.Args) {
a.String("src", "source image")
a.String("npack", "model")
a.String("cache", "cache")
},
})
}
func runCrash(ctx *grumble.Context) (err error) {
cl := ctx.App.CloserOneWay()
defer func() {
cErr := cl.Close()
if cErr != nil && err == nil {
err = cErr
}
}()
go cos.ListenForInterrupts(cl)
// Our model options.
lo := nlib.DefaultSemanticSegmentationOptions()
lo.Backend = nlib.ModelBackend_ONNXRuntime_TensorRT
lo.GPUID = 0
lo.ConfThresh = 0.5
lo.Workers = runtime.NumCPU()
lo.MatPoolSize = 3 * runtime.NumCPU()
lo.Interpolation = nlib.InterArea
lo.ResizeMaskToInput = false
lo.TRTEnableFP16 = true
lo.TRTCacheDir = ctx.Args.String("cache")
// Load the locate model.
lp, err := npack.OpenFile(cl.CloserTwoWay(), ctx.Args.String("npack"))
if err != nil {
return
}
model, err := nlib.NewSemanticSegmentationModel(cl.CloserTwoWay(), lp, lo)
if err != nil {
return
}
src := nlib.NewMat()
defer src.Free()
err = src.ReadFromFile(ctx.Args.String("src"))
if err != nil {
return
}
src.SetReadOnly(true)
start := time.Now()
var wg sync.WaitGroup
for i := 0; i < runtime.NumCPU(); i++ {
wg.Add(1)
go func() {
defer wg.Done()
mask := nlib.NewMat()
defer mask.Free()
for i := 0; i < 20; i++ {
err := model.Segment(src, mask)
if err != nil {
panic(err)
}
c, err := mask.CountNonZero()
if err != nil {
panic(err)
} else if c == 0 {
panic(fmt.Errorf("zero"))
}
}
}()
}
wg.Wait()
fmt.Println("took:", time.Since(start))
return
}
it'd help save us time if you can just provide us the input as either onnx TensorProto or numpy array. thanks! also you can try enabling cuda graph suggestion that @chilo-ms mentioned
Sorry I missed @chilo-ms reply. Will be back with more info soon.
we think we've identified the root cause. it's a bit complicated. we had previously fixed it but had to back out that fix due to another issue. unfortunately, we weren't able to resume that work until now. we will now work on implementing a full solution and provide you an ETA. you can help us confirm our hypothesis building from source commitid c969237 and testing your application again.
@r0l1 Even though the doc says "Multi-threaded usage is currently not supported in cuda graph", we can still try it with TRT EP. First, it requires a warm up run to capture cuda graph and this should be run by one thread (main thread). And then, we can run session.Run() concurrently.
@jywu-msft That commit fixes the issue. Thanks for working on a final fix :)
@chilo-ms Thank you for the info. We always did a synchronous warmup straight after loading the model, It would be great, if the thread-safety of the session Run would be documented in the docs. Last time I checked, I only found some issues talking about it.
I started to prepare a small C++ file to reproduce this issue. Do you still need it or is there a way to add this to the onnxruntime tests?
@r0l1 We had a fix for this concurrent issue and plan to make it into ORT 1.17 Could you also help verify it? thank you.
@chilo-ms that was fast. Just checked out the latest main branch and the issue is fixed now. Thank you very much!
@chilo-ms that was fast. Just checked out the latest main branch and the issue is fixed now. Thank you very much!
@r0l1 thanks for reporting the issue and helping us repro/test/validate!
Hi Everyone, Quick info, which version of onnxruntime did this patch go into, I am working on sherpa-onnx using onnxrt 1.17.1 Observing some mismatched output with TensorRT enabled but holds good with CPU,CUDA If this patch is in any later version will have a look and verify!
Should have been the 1.17 release. Check your tensorRT version. The newest version from NVIDIA has some problems.
Hi @r0l1,
Thanks for your suggestion.
Since I am using tensorrt with onnxruntime-gpu 1.17.1 (by doc tensorrt 8.6 is compatible) apart from that I have just installed below in my docker file
docker image : FROM nvidia/cuda:11.8.0-runtime-ubuntu22.04
libcudnn8=8.7.0.84-1+cuda11.8
RUN apt-get -y install libnvinfer8=8.6.1.6-1+cuda11.8 \
libnvinfer-plugin8=8.6.1.6-1+cuda11.8 \
libnvonnxparsers8=8.6.1.6-1+cuda11.8
Onnxruntime has dependency on these libs of tensorrt was my understanding, therefore not installing tensorrt separately, I think this should hold good. Let me know if this is otherwise.
Since 1.17 had this patch 1.17.1 should be holding good, any thoughts on tensorrt version for which this fix was? would be of great help!
Describe the issue
Doing concurrent session->Run calls without a mutex lock causes corrupt output matrices. The CPU and CUDA runtime providers work as excepted. Using the tensorRT provider causes this issue.
To reproduce
Forward matrices concurrently to one session without a mutex lock. Use the tensorRT runtime provider. The output matrices are corrupt and invalid.
Urgency
This issue is a release blocker and we must downgrade to ONNXRuntime 1.12.1 with tensorRT 8.4.3.
Platform
Linux
OS Version
linux 6.7.0 kernel
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
1.16.3
ONNX Runtime API
C++
Architecture
X64
Execution Provider
TensorRT
Execution Provider Library Version
CUDA 11.8.0 TensorRT 8.6.1.6