Open mg-yolo-enterprises opened 1 year ago
If your model has some operator need accumulation (like Softmax, LayerNormalization etc), CUDA result could be slightly different if partition has changed. Even without multi-threading, you can observe this when you run same inputs multiple times, and measure the variance of outputs.
I guess Multithread GPU prediction might cause GPU changes its partition more frequently. For example, when some cores are used by other thread, then GPU might schedule less cores for new requests. That might cause minor change in accuracy.
Another possible cause is convolution algo tuning, which might depend on GPU memory free space. If you use multi-threading, that means each thread might use less GPU memory since some memory is used by other threads, then convolution algo might change because some algo might need more memory to run. Unlike PyTorch, ORT does not have option to choose deterministic algo right now, so nondeterministic algorithms might be selected.
I appreciate your response! Unfortunately I'm not sure it gets to the root of this issue, because the issues I'm experiencing are not slight differences.
Here's an experiment I set up this morning:
In the cases where an incorrect prediction has been given, if I re-run the exact same tensor a second time, the result is correct. Here's an example...
For the following block of code, with a breakpoint set as shown:
...the first call to session.Run() produces a completely different result than the second. The first is incorrect, the second is correct:
In the screenshot above, the first call to session.Run() results in a 96% score for class 1 of 2, which is wrong. Calling session.Run() a second time with the same List
I'm only able to catch this behavior when running thousands of datasets, using GPU, using Parallel.ForEach.
Note that in C# Parallel.ForEach, it is possible to provide a parameter MaxDegreeOfParallelism, which limits the number of concurrent threads operating. If this value is not set, the loop runs as fast as possible and the problems described above are experienced. But, I found that if I set MaxDegreeOfParallelism to 1 or 2, I never encountered any incorrect predictions. Any value 3 or greater (or no value set) produces some incorrect predictions, and the number of incorrect predictions increases as the MaxDegreeOfParallelism increases.
It looks like there's plenty of GPU free memory while running:
Are there any reasons why a tensor passed to session.Run(), which results in an incorrect prediction, could result in a very different (correct) prediction when passed a second time? Keeping in mind that the incorrect prediction behavior disappears with any of the following changes:
It's desirable to solve this problem, because with GPU and 2 concurrent threads, the framerate is around 44fps. Allowing unlimited loops reaches 189fps, but with about 1 incorrect prediction per 500 frames.
@mg-yolo-enterprises, could you try the following: Create multiple inference sessions of the model, and parallel inference of these sessions. No parallel within each session: sequential inference of images within a session. If it could reproduce accuracy loss, the root cause is what I described previously.
I ended up putting a simple Lock() around the call to session.Run, which eliminated the problem I was experiencing of accuracy reduction during parallel inferences, without sacrificing any performance - probably because the preprocessing steps are my main bottleneck.
Describe the issue
A dataset of 20k images was used to perform transfer learning on a MobileNetV2 TF image classifier using https://github.com/tensorflow/hub/tree/master/tensorflow_hub/tools/make_image_classifier ...which was converted to ONNX format using https://github.com/onnx/tensorflow-onnx
The resulting model is being consumed using code provided in https://onnxruntime.ai/docs/get-started/with-csharp.html
The model performs tremendously well, achieving 100% accurate predictions over the entire dataset. Individual prediction scores average 95% for all images.
To improve the inference speed, the following changes were made:
Based on the answer provided to https://github.com/microsoft/onnxruntime/issues/114 I assumed the InferenceSession was threadsafe and thus didn't worry about locking it or creating a session pool.
The resulting speed increase is significant, as shown below:
Times listed above on Intel i7-12850HX, NVIDIA RTX A2000 Laptop GPU. Times include loading image from file, Bitmap resize operation, construction of Tensor, and call to Session.Run().
Surprisingly, it was discovered that only the first 3 scenarios listed above resulted in 100% accuracy of all model predictions. In the fourth case (GPU and Parallel.ForEach), a fairly random number of predictions will be false negatives or positives. The number is generally in the single-digits (over 20,000 total predictions), but not consistent from one run to the next. The resulting score given to the incorrect prediction is always around 50%, whereas the average score for accurate predictions is in the mid 90s.
Is there any reason why running many predictions in parallel while using the GPU could produce a prediction every so often that is wrong?
To reproduce
Model: model.onnx.zip
Code provided below:
Urgency
No response
Platform
Windows
OS Version
Windows 11 22H2
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.14.1
ONNX Runtime API
C#
Architecture
X64
Execution Provider
Default CPU, CUDA
Execution Provider Library Version
CUDA 11.6, cuDNN 8.5.0.96