accuracy reduced with multithreaded GPU prediction

mg-yolo-enterprises commented 1 year ago

Describe the issue

A dataset of 20k images was used to perform transfer learning on a MobileNetV2 TF image classifier using https://github.com/tensorflow/hub/tree/master/tensorflow_hub/tools/make_image_classifier ...which was converted to ONNX format using https://github.com/onnx/tensorflow-onnx

The resulting model is being consumed using code provided in https://onnxruntime.ai/docs/get-started/with-csharp.html

The model performs tremendously well, achieving 100% accurate predictions over the entire dataset. Individual prediction scores average 95% for all images.

To improve the inference speed, the following changes were made:

Reference the GPU Nuget package, update the InferenceSession constructor call accordingly
Perform the inference from within a Parallel.ForEach()

Based on the answer provided to https://github.com/microsoft/onnxruntime/issues/114 I assumed the InferenceSession was threadsafe and thus didn't worry about locking it or creating a session pool.

The resulting speed increase is significant, as shown below:

CPU inference, ForEach: 14 images/sec
CPU inference, Parallel.ForEach: 74 images/sec
GPU inference, ForEach: 11 images/sec
GPU inference, Parallel.ForEach: 152 images/sec

Times listed above on Intel i7-12850HX, NVIDIA RTX A2000 Laptop GPU. Times include loading image from file, Bitmap resize operation, construction of Tensor, and call to Session.Run().

Surprisingly, it was discovered that only the first 3 scenarios listed above resulted in 100% accuracy of all model predictions. In the fourth case (GPU and Parallel.ForEach), a fairly random number of predictions will be false negatives or positives. The number is generally in the single-digits (over 20,000 total predictions), but not consistent from one run to the next. The resulting score given to the incorrect prediction is always around 50%, whereas the average score for accurate predictions is in the mid 90s.

Is there any reason why running many predictions in parallel while using the GPU could produce a prediction every so often that is wrong?

To reproduce

Model: model.onnx.zip

Code provided below:

      using (var session = new InferenceSession(modelFilePath, SessionOptions.MakeSessionOptionWithCudaProvider()))
                  {
                      Parallel.ForEach(Directory.GetFiles(@"D:\Data Labeling\Labeled Dataset 1\Png\TowbarPresent"),
                          (filePath) =>
                          {
                              totalImages++;
                              var isw = new Stopwatch();
                              isw.Start();
                              var result = Predict(session, filePath, out var getTensorFromImageTime, out var processTime);
                              isw.Stop();
                              var prediction = result.towbarConfidence > result.noTowbarConfidence ? "TowbarPresent" : "TowbarNotPresent";
                              var scorePercent = Math.Max(result.towbarConfidence, result.noTowbarConfidence) * 100;
                              totalConfidence += scorePercent;
                              totalTensorTime += getTensorFromImageTime;
                              totalProcessTime += processTime;
                              switch (prediction)
                              {
                                  case "TowbarPresent":
                                      Console.ForegroundColor = ConsoleColor.Green;
                                      truePositives++;
                                      break;
                                  case "TowbarNotPresent":
                                      Console.ForegroundColor = ConsoleColor.Red;
                                      falseNegatives++;
                                      break;
                                  default:
                                      throw new Exception($"Unexpected result label: [{prediction}] for image [{filePath}]");
                              }
                              Console.WriteLine($"{totalImages}: {sw.Elapsed.ToString(@"hh\:mm\:ss")}| TP {truePositives} TN {trueNegatives} FP {falsePositives} FN {falseNegatives} AC {(totalConfidence / totalImages):F1}% | Current result [{prediction}] {scorePercent:F1}% in {isw.ElapsedMilliseconds}ms (Get Tensor: {getTensorFromImageTime}ms, Processing: {processTime}ms (Average {totalTensorTime / totalImages}ms, {totalProcessTime / totalImages}ms)" +
                                                $"({(prediction == "TowbarPresent" ? "Correct!" : "Incorrect!")}) Avg Rate {(totalImages / sw.Elapsed.TotalSeconds):F1} image per sec");
                          });

                      Parallel.ForEach(Directory.GetFiles(@"D:\Data Labeling\Labeled Dataset 1\Png\TowbarNotPresent"),
                          (filePath) =>
                          {
                              totalImages++;
                              var isw = new Stopwatch();
                              isw.Start();
                              var result = Predict(session, filePath, out var getTensorFromImageTime, out var processTime);
                              isw.Stop();
                              var prediction = result.towbarConfidence > result.noTowbarConfidence ? "TowbarPresent" : "TowbarNotPresent";
                              var scorePercent = Math.Max(result.towbarConfidence, result.noTowbarConfidence) * 100;
                              totalConfidence += scorePercent;
                              totalTensorTime += getTensorFromImageTime;
                              totalProcessTime += processTime;
                              switch (prediction)
                              {
                                  case "TowbarNotPresent":
                                      Console.ForegroundColor = ConsoleColor.Green;
                                      trueNegatives++;
                                      break;
                                  case "TowbarPresent":
                                      Console.ForegroundColor = ConsoleColor.Red;
                                      falsePositives++;
                                      break;
                                  default:
                                      throw new Exception($"Unexpected result label: [{prediction}] for image [{filePath}]");
                              }
                              Console.WriteLine($"{totalImages}: {sw.Elapsed.ToString(@"hh\:mm\:ss")}| TP {truePositives} TN {trueNegatives} FP {falsePositives} FN {falseNegatives} AC {(totalConfidence / totalImages):F1}% | Current result [{prediction}] {scorePercent:F1}% in {isw.ElapsedMilliseconds}ms (Get Tensor: {getTensorFromImageTime}ms, Processing: {processTime}ms (Average {totalTensorTime / totalImages}ms, {totalProcessTime / totalImages}ms)" +
                                                $"({(prediction == "TowbarNotPresent" ? "Correct!" : "Incorrect!")}) Avg Rate {(totalImages / sw.Elapsed.TotalSeconds):F1} image per sec");
                              //if (prediction != "TowbarNotPresent")
                              //{
                              //    File.Copy(filePath, Path.Combine(@"D:\Data Labeling\Labeled Dataset 1\2023-03-13 Wrong\TowbarNotPresent", Path.GetFileName(filePath)));
                              //}
                          });

                      Console.ReadKey();

  private static (float noTowbarConfidence, float towbarConfidence) Predict(InferenceSession session, string filePath, out long getTensorFromImageTime, out long processTime)
          {
              var psw = new Stopwatch();
              psw.Restart();
              var input = GetTensorFromImageFile(filePath, 224);
              psw.Stop();
              getTensorFromImageTime = psw.ElapsedMilliseconds;

              var inputs = new List<NamedOnnxValue>
              {
                  NamedOnnxValue.CreateFromTensor("input_1", input)
              };

              psw.Restart();
              using (IDisposableReadOnlyCollection<DisposableNamedOnnxValue> results = session.Run(inputs))
              {
                  psw.Stop();
                  processTime = psw.ElapsedMilliseconds;

                  var output = results.First().AsEnumerable<float>().ToList();
                  return (output.First(), output.Last());
              }
          }

          private static Tensor<float> GetTensorFromImageFile(string filePath, int dim)
          {
              using (var image = new Bitmap(filePath))
              {
                  return TryGetTensorFromBitmap(image, dim, out var tensor) ? tensor : default;
              }
          }

          [HandleProcessCorruptedStateExceptions]
          private static unsafe bool TryGetTensorFromBitmap(System.Drawing.Image image, int dim, out Tensor<float> tensor)
          {
              tensor = default;
              using (var resizedImage = new Bitmap(image, new System.Drawing.Size(dim, dim)))
              {
                  BitmapData bmd = null;

                  try
                  {
                      tensor = new DenseTensor<float>(new[] { 1, resizedImage.Width, resizedImage.Height, 3 });
                      bmd = resizedImage.LockBits(new System.Drawing.Rectangle(0, 0, resizedImage.Width, resizedImage.Height), ImageLockMode.ReadOnly, resizedImage.PixelFormat);
                      const int pixelSize = 4;

                      for (var y = 0; y < bmd.Height; y++)
                      {
                          // row is a pointer to a full row of data with each of its colors
                          var row = (byte*)bmd.Scan0 + (y * bmd.Stride);
                          for (var x = 0; x < bmd.Width; x++)
                          {
                              // note the order of colors is BGR
                              tensor[0, y, x, 0] = row[x * pixelSize + 2] / (float)255.0;
                              tensor[0, y, x, 1] = row[x * pixelSize + 1] / (float)255.0;
                              tensor[0, y, x, 2] = row[x * pixelSize + 0] / (float)255.0;
                          }
                      }

                      return true;
                  }
                  catch (Exception e)
                  {
                      Console.WriteLine(e);
                  }
                  finally
                  {
                      if (bmd != null)
                      {
                          resizedImage.UnlockBits(bmd);
                      }
                  }
              }

              return false;
          }

Urgency

No response

Platform

Windows

OS Version

Windows 11 22H2

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.14.1

ONNX Runtime API

C#

Architecture

X64

Execution Provider

Default CPU, CUDA

Execution Provider Library Version

CUDA 11.6, cuDNN 8.5.0.96

tianleiwu commented 1 year ago

If your model has some operator need accumulation (like Softmax, LayerNormalization etc), CUDA result could be slightly different if partition has changed. Even without multi-threading, you can observe this when you run same inputs multiple times, and measure the variance of outputs.

I guess Multithread GPU prediction might cause GPU changes its partition more frequently. For example, when some cores are used by other thread, then GPU might schedule less cores for new requests. That might cause minor change in accuracy.

Another possible cause is convolution algo tuning, which might depend on GPU memory free space. If you use multi-threading, that means each thread might use less GPU memory since some memory is used by other threads, then convolution algo might change because some algo might need more memory to run. Unlike PyTorch, ORT does not have option to choose deterministic algo right now, so nondeterministic algorithms might be selected.

mg-yolo-enterprises commented 1 year ago

I appreciate your response! Unfortunately I'm not sure it gets to the root of this issue, because the issues I'm experiencing are not slight differences.

Here's an experiment I set up this morning:

Run ~5000 images using GPU and Parallel.ForEach. The images are all part of the same class (Class 2 of 2), so Output.Last() should always be greater than Output.First().
I ran this experiment using GPU and ForEach, and all predictions were correct.
I then re-ran this experiment several times using GPU and Parallel.ForEach, which runs much faster due to parallelism. The prediction accuracy was no longer 100%. Out of 5000 images, I would usually see around 10 images whose scores were completely wrong. If I reran the exact same tensor object through a subsequent call to session.Run(). the results were correct. Further details below:

In the cases where an incorrect prediction has been given, if I re-run the exact same tensor a second time, the result is correct. Here's an example...

For the following block of code, with a breakpoint set as shown:

...the first call to session.Run() produces a completely different result than the second. The first is incorrect, the second is correct:

In the screenshot above, the first call to session.Run() results in a 96% score for class 1 of 2, which is wrong. Calling session.Run() a second time with the same List produces a correct prediction.

I'm only able to catch this behavior when running thousands of datasets, using GPU, using Parallel.ForEach.

Note that in C# Parallel.ForEach, it is possible to provide a parameter MaxDegreeOfParallelism, which limits the number of concurrent threads operating. If this value is not set, the loop runs as fast as possible and the problems described above are experienced. But, I found that if I set MaxDegreeOfParallelism to 1 or 2, I never encountered any incorrect predictions. Any value 3 or greater (or no value set) produces some incorrect predictions, and the number of incorrect predictions increases as the MaxDegreeOfParallelism increases.

It looks like there's plenty of GPU free memory while running:

Are there any reasons why a tensor passed to session.Run(), which results in an incorrect prediction, could result in a very different (correct) prediction when passed a second time? Keeping in mind that the incorrect prediction behavior disappears with any of the following changes:

Changing GPU package to CPU
Using GPU with <=2 parallel foreach loops

It's desirable to solve this problem, because with GPU and 2 concurrent threads, the framerate is around 44fps. Allowing unlimited loops reaches 189fps, but with about 1 incorrect prediction per 500 frames.

tianleiwu commented 1 year ago

@mg-yolo-enterprises, could you try the following: Create multiple inference sessions of the model, and parallel inference of these sessions. No parallel within each session: sequential inference of images within a session. If it could reproduce accuracy loss, the root cause is what I described previously.

mg-yolo-enterprises commented 1 year ago

I ended up putting a simple Lock() around the call to session.Run, which eliminated the problem I was experiencing of accuracy reduction during parallel inferences, without sacrificing any performance - probably because the preprocessing steps are my main bottleneck.

microsoft / onnxruntime