microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
13.5k stars 2.76k forks source link

No Performance Benefit from OnnxRuntime.GPU in ML.NET #10142

Open noumanqaiser opened 2 years ago

noumanqaiser commented 2 years ago

Describe the bug I have an Image classification model that was trained using Microsoft CustomVision and exported as an ONNX model. I am able to run inferencing using this model with an average inference time of around 45ms. My computer is equipped with an NVIDIA GPU and I have been trying to reduce the inference time.

My application is a .NET console application written in C#.

I tried utilizing the OnnxRuntime.GPU nuget package version 1.10 and followed in steps given on the link below to install the relevant CUDA Toolkit and Cudnn packages. (https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements). Despite this, I have not seem any performance improvement when using OnnxRuntime or OnnxRuntime.GPU. The average inference time is similar and varies between 45 to 60ms.

Urgency I have been trying various options to improve inference performance but none of them seem to be working. Urgent support would be appreciated.

System information Windows 10 Home 21H1, Dell Inspiron 5406, Core i7 1165G7, 16GB RAM with Nvidia MX330 2GB GPU ONNX Runtime installed from Nuget ONNX Runtime version: 1.10.0 Program is written in C#, .NET 5, Console App Visual Studio 2019 v16.10.3 CUDA/CudNN version: CUDA Tooklit 11.4.3 , CudNN 8.2.2.26 for Cuda 11.4 GPU model and memory: Nvidia MX330 with 2GB Memory

To Reproduce I use the following class to initiate an ONNX Scoring class: `public class OnnxModelScorer {

public class ImageInputData
{
    [ImageType(300, 300)]
    public Bitmap Image { get; set; }
}

public class ImagePrediction
{

    [ColumnName("model_output")]
    public float[] PredictedLabels;
}

PredictionEngine<ImageInputData, ImagePrediction> predictionEngine;
ModelMetadataPropertiesClass modelprops;
Dictionary<int, string> ModelLabels = new Dictionary<int, string>();

public void SetupPredictionEngine(string modelFolderPath, out string errors)
{
    errors = "";
    predictionEngine = null;
    try
    {
        var mlContext = new MLContext();

        modelprops = LoadProperties(modelFolderPath + "metadata_properties.json", out string error);

        var pipeline = mlContext.Transforms
                        .ResizeImages("image", modelprops.CustomVisionPreprocessTargetWidth, modelprops.CustomVisionPreprocessTargetHeight, nameof(ImageInputData.Image), ImageResizingEstimator.ResizingKind.Fill)
                        .Append(mlContext.Transforms.ExtractPixels("data", "image"))
                        .Append(mlContext.Transforms.ApplyOnnxModel("model_output", "data", modelFolderPath + @"model.onnx"));

        var data = mlContext.Data.LoadFromEnumerable(new List<ImageInputData>());
        var model = pipeline.Fit(data);

        predictionEngine = mlContext.Model.CreatePredictionEngine<ImageInputData, ImagePrediction>(model);

        string[] labels = File.ReadAllText(modelFolderPath + @"labels.txt").Split('\n');

        int i = 0;
        foreach (var label in labels)
        {
            ModelLabels.Add(i, label);
            i++;
        }
    }
    catch (Exception ex)
    {
        errors = "Model Loading Failed: " + ex.ToString();
    }

}

public PredictionResultClass GetModelPrediction(Bitmap sample, out string error)
{
    PredictionResultClass pr = new PredictionResultClass();
    error = "";
    if (predictionEngine != null)
    {
        var input = new ImageInputData { Image = sample };

        var prediction = predictionEngine.Predict(input);
        Dictionary<int, PredictionResultClass> predictionResults = new Dictionary<int, PredictionResultClass>();
        int indexofMaxProb = -1;
        float maxProbability = 0;
        for (int i = 0; i < prediction.PredictedLabels.Count(); i++)
        {
            predictionResults.Add(i,new PredictionResultClass() { Label = ModelLabels[i], probability = prediction.PredictedLabels[i] });

            if(prediction.PredictedLabels[i]>maxProbability)
            {
                maxProbability = prediction.PredictedLabels[i];
                indexofMaxProb = i;
            }
        }

        pr = predictionResults[indexofMaxProb];

    }
    else error = "Prediction Engine Not initialized";

    return pr;
}
public class PredictionResultClass
{
    public string Label = "";
    public float probability = 0;
}

public void ModelMassTest(string samplesfolder)
{

    string[] inputfiles = Directory.GetFiles(samplesfolder);
    List<double> analysistimes = new List<double>();
    foreach (var fl in inputfiles)
    {

        //Emgu.CV.Image<Emgu.CV.Structure.Bgr, byte> Img = new Emgu.CV.Image<Emgu.CV.Structure.Bgr, byte>(fl);
        // Img.ROI = JsonConvert.DeserializeObject<Rectangle>("\"450, 288, 420, 1478\"");
        // string savePath = @"C:\ImageMLProjects\Tresseme200Ml Soiling Experiment\Tresseme200MlImages\ROIApplied\Bad\" + Path.GetFileName(fl);
        // Img.Save(savePath);

        //Bitmap bitmap = Emgu.CV.BitmapExtension.ToBitmap(Img); // your source of a bitmap
        Bitmap bitmap = new Bitmap(fl);
        Stopwatch sw = new Stopwatch();
        sw.Start();
        var res =  GetModelPrediction(bitmap, out string error);

        sw.Stop();
        PrintResultsonConsole(res, Path.GetFileName(fl));

        Console.WriteLine($"Analysis Time(ms): {sw.ElapsedMilliseconds}");
        analysistimes.Add(sw.ElapsedMilliseconds);

    }

    if(analysistimes.Count()>0)
        Console.WriteLine($"Average Analysis Time(ms): {analysistimes.Average()}");
}

public static ModelMetadataPropertiesClass LoadProperties(string MetadatePropertiesFilepath, out string error)
{
    string propertiesText = File.ReadAllText(MetadatePropertiesFilepath);
    error = "";
    ModelMetadataPropertiesClass mtp = new ModelMetadataPropertiesClass();

    try
    {
        mtp = JsonConvert.DeserializeObject<ModelMetadataPropertiesClass>(propertiesText);
    }
    catch (Exception ex)
    {
        error = ex.ToString();
        mtp = null;
    }

    return mtp;
}
public class ModelMetadataPropertiesClass
{
    [JsonProperty("CustomVision.Metadata.AdditionalModelInfo")]
    public string CustomVisionMetadataAdditionalModelInfo { get; set; }

    [JsonProperty("CustomVision.Metadata.Version")]
    public string CustomVisionMetadataVersion { get; set; }

    [JsonProperty("CustomVision.Postprocess.Method")]
    public string CustomVisionPostprocessMethod { get; set; }

    [JsonProperty("CustomVision.Postprocess.Yolo.Biases")]
    public string CustomVisionPostprocessYoloBiases { get; set; }

    [JsonProperty("CustomVision.Postprocess.Yolo.NmsThreshold")]
    public string CustomVisionPostprocessYoloNmsThreshold { get; set; }

    [JsonProperty("CustomVision.Preprocess.CropHeight")]
    public string CustomVisionPreprocessCropHeight { get; set; }

    [JsonProperty("CustomVision.Preprocess.CropMethod")]
    public string CustomVisionPreprocessCropMethod { get; set; }

    [JsonProperty("CustomVision.Preprocess.CropWidth")]
    public string CustomVisionPreprocessCropWidth { get; set; }

    [JsonProperty("CustomVision.Preprocess.MaxDimension")]
    public string CustomVisionPreprocessMaxDimension { get; set; }

    [JsonProperty("CustomVision.Preprocess.MaxScale")]
    public string CustomVisionPreprocessMaxScale { get; set; }

    [JsonProperty("CustomVision.Preprocess.MinDimension")]
    public string CustomVisionPreprocessMinDimension { get; set; }

    [JsonProperty("CustomVision.Preprocess.MinScale")]
    public string CustomVisionPreprocessMinScale { get; set; }

    [JsonProperty("CustomVision.Preprocess.NormalizeMean")]
    public string CustomVisionPreprocessNormalizeMean { get; set; }

    [JsonProperty("CustomVision.Preprocess.NormalizeStd")]
    public string CustomVisionPreprocessNormalizeStd { get; set; }

    [JsonProperty("CustomVision.Preprocess.ResizeMethod")]
    public string CustomVisionPreprocessResizeMethod { get; set; }

    [JsonProperty("CustomVision.Preprocess.TargetHeight")]
    public int CustomVisionPreprocessTargetHeight { get; set; }

    [JsonProperty("CustomVision.Preprocess.TargetWidth")]
    public int CustomVisionPreprocessTargetWidth { get; set; }

    [JsonProperty("Image.BitmapPixelFormat")]
    public string ImageBitmapPixelFormat { get; set; }

    [JsonProperty("Image.ColorSpaceGamma")]
    public string ImageColorSpaceGamma { get; set; }

    [JsonProperty("Image.NominalPixelRange")]
    public string ImageNominalPixelRange { get; set; }
}

public static void PrintResultsonConsole( PredictionResultClass pr,string  filePath)
{
    var defaultForeground = Console.ForegroundColor;
    var labelColor = ConsoleColor.Magenta;
    var probColor = ConsoleColor.Blue;
    var exactLabel = ConsoleColor.Green;
    var failLabel = ConsoleColor.Red;

    Console.Write("ImagePath: ");
    Console.ForegroundColor = labelColor;
    Console.Write($"{Path.GetFileName(filePath)}");
    Console.ForegroundColor = defaultForeground;

    Console.ForegroundColor = defaultForeground;
    Console.Write(" predicted as ");
    Console.ForegroundColor = exactLabel;
    Console.Write($"{pr.Label}");

    Console.ForegroundColor = defaultForeground;
    Console.Write(" with probability ");
    Console.ForegroundColor = probColor;
    Console.Write(pr.probability);
    Console.ForegroundColor = defaultForeground;
    Console.WriteLine("");
}

} `

To execute inferencing, I then initiate the modelScorer and consume it. `static void Main(string[] args) { var onnxModelScorer = new OnnxModelScorer();

        onnxModelScorer.SetupPredictionEngine(@"..\..\..\OnnxModel\", out string error);
        onnxModelScorer.ModelMassTest(@"..\..\..\SampleImages\Bad\");
        ConsoleHelpers.ConsolePressAnyKey();
        onnxModelScorer.ModelMassTest(@"..\..\..\SampleImages\Good\");

        ConsoleHelpers.ConsolePressAnyKey();

ConsoleHelpers.ConsolePressAnyKey();

} `

Expected behavior When utilizing the Onnxruntime package, the average inferencing time is ~40ms, with Onnxruntime.GPU I expected it to be less than 10ms

Screenshots NA

Additional context This is a performance oriented question, on how well Onnxruntime.GPU allows .NET developers to exploit benefits of faster inferencing using Nvidia GPUs.

If having the full project with OnnxModel and sample images would help you investigate better, please access the following link and request access: https://drive.google.com/drive/folders/1DqnUvTaU9xp2QLuV_X9jFCjkratckMYL?usp=sharing

yuslepukhin commented 2 years ago

The ability to take advantage of GPU depends on two factors 1) the way ONNX Runtime is used (is GPU provider enabled) via ONNX Runtime API 2) whether the resulting model contains references to kernels that have GPU implementation.

The above code is a very high level of another Microsoft ML framework that invokes ONNX Runtime somewhere deep underneath. It may be an ML model that may only refer to non-standard nodes that have no GPU implementation. You can visualize the model using https://netron.app/.

If the model is a mix of GPU supported and unsupported nodes, inference would incur a cost of transferring data back and forth rather than staying on GPU for the whole inference time.

Stanard ONNX ops are documented here.

Supported ML extensions to facilitate MS ML execution are documented here. I think most of them, if not all, are CPU only.

Official docs here refer to ONNX models we know are fully supported by ONNX Runtime.

Filing an issue here might help.

skottmckay commented 2 years ago

If it's an image processing model it's highly likely all operators are supported on GPU, however the GPU execution provider (EP) is not registered by default - you have to explicitly do that and I don't know what ML.Net does under the covers.

I believe there's also a requirement to have CUDA and cuDNN installed otherwise the CUDA EP won't load. There's some older ML.Net documentation about this linked in this StackOverflow question. The CUDA/cuDNN versions ORT requires is listed here.

You could alternatively call ORT directly but would need to add code to do the pre/post processing. As you just seem to have an image resize on the pre-processing this example code may help. There are other libraries that can be used for the resize depending on your requirements - see here for a comparison of the most popular ones.

noumanqaiser commented 2 years ago

The onnx model can be accessed at the link below,

https://1drv.ms/u/s!AoCxHIRfqNffinv_ECpvMGDY4jbl?e=aLBuMm

My understanding is, the instructions/Opset should not be an issue, but I have very limited understanding of opsets and I would appreciate if you could take a look of the onnx file and confirm.

Referring to @skottmckay , I have tried with various combinations of CUDA and CudNN, I couldn't find any specific document explaining what Cuda/Cudnn version combination would allow GPU inferencing with ML.NET, but to use ML.NET for model training requires Cuda 10.1 with CudNN 7.6.4 as explained here( https://docs.microsoft.com/en-us/dotnet/machine-learning/how-to-guides/install-gpu-model-builder), However there is no mention if impacts ML.NET's inference or only training.

On the OnnxRuntime website. A table of compatible Cuda & CudNN versions is given, (https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements), based on this Cuda 11.4 with CudNN 8.2.2.26 should help exploit GPU when consumed with Microsoft.ML.OnnxRuntime.Gpu Nuget package.

I have tried out both these combinations and in none of the cases, the inference performance shows any improvement.

This leaves me with the final option of utilizing external libraries for preprocessing and calling ORT directly, which sort of takes ML.NET entirely out of the equation. In Mike's article, I see that he hasnt tried it out with OnnxRuntime Cuda or DirectML Execution providers, but this may be worth trying out.

yuslepukhin commented 2 years ago

I have contacted @michaelsharp from ML.NET

noumanqaiser commented 2 years ago

@michaelgsharp @skottmckay @yuslepukhin

As scott suggested, I created another class in the project which consumes ONNX Runtime directly and image pre-processing(resizing and Tensor loading) is managed externally. Unfortunately, I see no performance gains.

For this benchmarking, I am using the following environment: Windows 10 Home 21H1, Dell Inspiron 5406, Core i7 1165G7, 16GB RAM with Nvidia MX330 2GB GPU Program is written in C#, .NET 5, Console App Visual Studio 2022 v17.0.4 CUDA/CudNN version: CUDA Tooklit 11.4.3 , CudNN 8.2.2.26 for Cuda 11.4 GPU model and memory: Nvidia MX330 with 2GB Memory

Type of Model: Image Classification Model trained on: Microsoft CustomVision and exported as Onnx Model, Model Details are below: image

In all experiments, I deleted the bin folder entirely and rebuilt solution to ensure no mixup of dlls. Same image data set of around 100 images was used for comparison.

Benchmark 1: Image classification, using ML.NET and Onnxruntime 1.10 and Microsoft.ML version 1.7

Code shared Earlier in the first post: Average Analysis time: 71.6ms Average Processor % during inferencing: ~50% Average Memory Usage: ~300MB image

Benchmark 2: Image classification, Onnxruntime 1.10 and resizing/tensor loading managed externally,

Here is the code I used for Inferencing: ` Dictionary<int, string> ModelLabels = new Dictionary<int, string>(); List inferenceTimes = new List(); List ResizingTimes = new List(); List ImagetoTensorConversionTimes = new List(); public void GetMasspredictions(string samplesfolder, string modelFolderPath) { inferenceTimes.Clear(); ResizingTimes.Clear(); ImagetoTensorConversionTimes.Clear();

        string[] inputfiles = Directory.GetFiles(samplesfolder);
        string modelPath = modelFolderPath + @"model.onnx";
        ModelLabels.Clear();
        string[] labels = File.ReadAllText(modelFolderPath + @"labels.txt").Split('\n');

        int i = 0;
        foreach (var label in labels)
        {
            ModelLabels.Add(i, label);
            i++;
        }

         i = 0;

        using (var session = new InferenceSession(modelPath))//, SessionOptions.MakeSessionOptionWithCudaProvider()))
        {
            foreach (var fl in inputfiles)
            {

                Bitmap bitmap = new Bitmap(fl);
                Stopwatch sw = new Stopwatch();

                var inputs = GetModelInput(bitmap);
                sw.Start();

                // Run the inference
                using (var results = session.Run(inputs))
                {
                    // Get the results
                    foreach (var r in results)
                    {

                        int prediction = MaxProbability(r.AsTensor<float>());
                        Console.WriteLine("Prediction: " + ModelLabels[prediction].ToString());

                    }
                }

                sw.Stop();

                Console.WriteLine($"Inference Time(ms): {sw.ElapsedMilliseconds}");
                i++;
                if (i > 2)  //avoiding the initial samples form stats as they take up much longer
                    inferenceTimes.Add(sw.ElapsedMilliseconds);
            }

        }

        if (inferenceTimes.Count() > 0)
        {
            Console.WriteLine($"Average Inference Time(ms): {inferenceTimes.Average()}");
            Console.WriteLine($"Average Resizing Time(ms): {ResizingTimes.Average()}");
            Console.WriteLine($"Average TensorLoading Time(ms): {ImagetoTensorConversionTimes.Average()}");
            Console.WriteLine($"Average Total Time(ms): {ImagetoTensorConversionTimes.Average() + ResizingTimes.Average()+ inferenceTimes.Average() }");
        }
    }

    public List<NamedOnnxValue> GetModelInput(Bitmap FullImage)
    {
        Stopwatch sw = new Stopwatch();
        sw.Start();
        var inputImage = ResizeBitmap(FullImage, 300, 300);
        sw.Stop();
        ResizingTimes.Add(sw.ElapsedMilliseconds);

        sw.Reset();

        sw.Start();

        //Image to tensor conversion OPTION 1
        //an unsafe method to convert imamge to float, saves around 50ms compared to the one below.
        Tensor<float> input =  ConvertImageToFloatTensorUnsafe(inputImage);

        //Image to Tensor Converstion OPTION 2
        //for a 300x300, can take around 80ms.
        /*
        Tensor<float> input = new DenseTensor<float>(new[] { 1, 3, 300, 300 });
        var mean = new float[] { 0, 0, 0 };
        for (int y = 0; y < inputImage.Height; y++)
        {

            for (int x = 0; x < inputImage.Width; x++)
            {
                var pixel = inputImage.GetPixel(x,y);
                input[0, 0, y, x] = (pixel.R - mean[0]);
                input[0, 1, y, x] = (pixel.G - mean[1]);
                input[0, 2, y, x] = (pixel.B - mean[2]);
            }
        }
        */
        // Setup inputs and outputs
        var inputs = new List<NamedOnnxValue>
        {
            NamedOnnxValue.CreateFromTensor<float>("data",input)
        };
        sw.Stop();
        ImagetoTensorConversionTimes.Add(sw.ElapsedMilliseconds);

        return inputs;

    }

    public Tensor<float> ConvertImageToFloatTensorUnsafe(Bitmap image)
    {
        // Create the Tensor with the appropiate dimensions  for the NN
        Tensor<float> data = new DenseTensor<float>(new[] { 1,3, image.Width, image.Height });

        BitmapData bmd = image.LockBits(new System.Drawing.Rectangle(0, 0, image.Width, image.Height), System.Drawing.Imaging.ImageLockMode.ReadOnly, image.PixelFormat);
        int PixelSize = 3;

        unsafe
        {
            for (int y = 0; y < bmd.Height; y++)
            {
                // row is a pointer to a full row of data with each of its colors
                byte* row = (byte*)bmd.Scan0 + (y * bmd.Stride);
                for (int x = 0; x < bmd.Width; x++)
                {
                    // note the order of colors is BGR
                    data[0, 0,y, x] = row[x * PixelSize + 0];// / (float)255.0;
                    data[0,1, y, x] = row[x * PixelSize + 1];// / (float)255.0;
                    data[0,2, y, x] = row[x * PixelSize + 2];// / (float)255.0;
                }
            }

            image.UnlockBits(bmd);
        }
        return data;
    }

    public Bitmap ResizeBitmap(Bitmap bmp, int width, int height)
    {
        Bitmap result = new Bitmap(width, height);
        using (Graphics g = Graphics.FromImage(result))
        {
            g.DrawImage(bmp, 0, 0, width, height);
        }

        return result;
    }

    static int MaxProbability(Tensor<float> probabilities)
    {
        float max = -9999.9F;
        int maxIndex = -1;
        for (int i = 0; i < probabilities.Length; ++i)
        {
            float prob = probabilities.GetValue(i);
            if (prob > max)
            {
                max = prob;
                maxIndex = i;
            }
        }
        return maxIndex;

    }

`

Results:

Average Inference Time(ms): 59.6 Average Resizing Time(ms): 10.629032258064516 Average TensorLoading Time(ms): 55.70967741935484 Average Total Time(ms): 125.93870967741935

Average Memory Usage: 173 MB Average Processor % = ~50% image

Benchmark 3: Image classification, Onnxruntime.GPU 1.10 and resizing/tensor loading managed externally,

Just one change, when creating the session, I use the following line instead: using (var session = new InferenceSession(modelPath ,SessionOptions.MakeSessionOptionWithCudaProvider()))

Average Inference Time(ms): 37.483333333333334 Average Resizing Time(ms): 6.225806451612903 Average TensorLoading Time(ms): 29.870967741935484 Average Total Time(ms): 73.58010752688172

Average Memory Usage: 3.2 GB !!! Average Processor Usage: 13% image

I did another benchmark with Onnxruntime.GPU but with the session being created without GPU: using (var session = new InferenceSession(modelPath))

In this case, the results are almost same as benchmark 2, Hence I believe the GPU doesnt even come into action.

Benchmark 4: Image classification, Onnxruntime.GPU 1.10 with ML.NET used for transformation

Pipeline created using this code: `

var pipeline = mlContext.Transforms
                                .ResizeImages("image", modelprops.CustomVisionPreprocessTargetWidth, modelprops.CustomVisionPreprocessTargetHeight, nameof(ImageInputData.Image), ImageResizingEstimator.ResizingKind.Fill)
                                .Append(mlContext.Transforms.ExtractPixels("data", "image"))
                                .Append(mlContext.Transforms.ApplyOnnxModel("model_output", "data", modelFolderPath + @"model.onnx"));

` Average analysis time = 70.8 ms Processor usage: ~50% Memory: 287 MB.

Benchmark 5: Image classification, Onnxruntime.GPU 1.10 with ML.NET used for transformation, clearly proving device ID when creating pipeline

Pipeline created using this code: var pipeline = mlContext.Transforms .ResizeImages("image", modelprops.CustomVisionPreprocessTargetWidth, modelprops.CustomVisionPreprocessTargetHeight, nameof(ImageInputData.Image), ImageResizingEstimator.ResizingKind.Fill) .Append(mlContext.Transforms.ExtractPixels("data", "image")) .Append(mlContext.Transforms.ApplyOnnxModel("model_output", "data", modelFolderPath + @"model.onnx", 0));

Average analysis time = varies from 120 to 150ms Processor usage: ~13% Memory: 3.2GB.

image

using any other device ID other than 0 results in an exception when the pipeline is created saying {"Error initializing model :Microsoft.ML.OnnxRuntime.OnnxRuntimeException: [ErrorCode:Fail] D:\\a\\_work\\1\\s\\onnxruntime\\core\\providers\\cuda\\cuda_call.cc:122 onnxruntime::CudaCall D:\\a\\_work\\1\\s\\onnxruntime\\core\\providers\\cuda\\cuda_call.cc:116 onnxruntime::CudaCall CUDA failure 101: invalid device ordinal ; GPU=0 ; hostname=DESKTOP-IC179BD ; expr=cudaSetDevice(info_.device_id); \n\n\r\n at Microsoft.ML.OnnxRuntime.NativeApiStatus.VerifySuccess(IntPtr nativeStatus)\r\n at Microsoft.ML.OnnxRuntime.InferenceSession.Init(String modelPath, SessionOptions options, PrePackedWeightsContainer prepackedWeightsContainer)\r\n at Microsoft.ML.OnnxRuntime.InferenceSession..ctor(String modelPath, SessionOptions options)\r\n at Microsoft.ML.Transforms.Onnx.OnnxModel..ctor(String modelFile, Nullable1 gpuDeviceId, Boolean fallbackToCpu, Boolean ownModelFile, IDictionary2 shapeDictionary, Int32 recursionLimit, Nullable1 interOpNumThreads, Nullable1 intraOpNumThreads)\r\n at Microsoft.ML.Transforms.Onnx.OnnxTransformer..ctor(IHostEnvironment env, Options options, Byte[] modelBytes)"}

Not providing the device ID at all leads to same results as Benchmark 1. Which means GPU is not used.

Benchmark 6 Using Onnxruntime.DirectML v1.10 with ML.NET, no device ID given during pipeline initiation

Same results as benchmark 1

Benchmark 7 Using Onnxruntime.DirectML v1.10 with ML.NET, device ID 0 given during pipeline initiation

Code does not execute and leads to exception during pipeline creation: {"Unable to find an entry point named 'OrtSessionOptionsAppendExecutionProvider_CUDA' in DLL 'onnxruntime'.":""}

Conclusion:

The question is, what could be done to bring down this inference time drastically. So far nothing that i did seems to be helping bring the inference time to less than 15ms.

To take this forward and have you guys try out all scenarios, I am willing to share the full project with onnx model and sample images so further benchmarking can be done.

skottmckay commented 2 years ago

Can you please tighten up how you're measuring time? Including something like Console.Writeline can greatly skew the numbers for a run. Things like the tensor loading time in benchmark 2 vs 3 is really suspicious. Shouldn't that always be pretty equal given it has nothing to do with how the data is executed?

For the discussion of whether ORT is using GPU or not, it will be simpler to just focus on the Run call time - i.e start/stop of the stopwatch for a single line that calls Run.

The CUDA operator kernels are rather large, so I think the much larger memory usage is a good signal that the CUDA execution provider is being loaded.

Benchmark 3 was 37.5ms vs 59.6 with CPU for benchmark 2. That seems significantly better to me and a good signal that the CUDA execution provider is being used.

There's also the device copy overhead to copy input/output between CPU and CUDA to be able to run the model and retrieve the results to take into account. It's possible to avoid that but doing so is only meaningful if your input starts on GPU (won't do given you have pre-processing) or output will be used on GPU (generally not the case unless you're feeding the output into another model).

I'm not quite sure what your expectations are regarding performance. Where does the 15ms target come from?

skottmckay commented 2 years ago

One other thing you could do is use the NVIDIA Control Panel to see GPU utilization when running the model.

noumanqaiser commented 2 years ago

@skottmckay The entire discussion stems from an evaluation where my team is evaluating if Microsoft CustomVision could be used as a generic Model training platform, and models eventually deployed in manufacturing to run high speed inferencing on images obtained using Machine vision cameras, so the idea was if Microsoft CustomVision+OnnxRuntime+ML.NET could be a high performance solution for defect analysis. On high speed manufacturing lines, inference performance is key to reaching a decision if the product being analyzed should be rejected or production line stopped.

Now in the case above, what I takeaway is, there is hardly much performance difference in Benchmark 1(ML.NET+OnnxRuntime) vs Benchmark 3(OnnxRuntime.GPU+External Preprocessing). The added value from Cuda hardware acceleration is being offset elsewhere. Do you think Benchmark 1 is the best possible solution in this case or we still have some options?

skottmckay commented 2 years ago

At this point I would have said the numbers weren't reliable enough and I'd be looking to re-measure without any extraneous code (especially Console.WriteLine calls within timed regions).

Something like the Image pre/post processing times should be consistent. Without achieving consistency or understanding why you can't, I would have a low level of confidence in the numbers in general. Using averages is also potentially flawed. One slow result (e.g. Console.WriteLine blocks) could overly affect the values, so understanding the distribution of the latency measurements may be important (depending on what percentile you measure production latency at of course).

Is it possible to take all the pre/post-processing out of the measurements and just measure the Run/predict call. Hopefully there's a way to call ML.NET with your manually pre-processed image as input to be able to do that for the ML.NET values.

Not sure if CustomVision has an option to move the pre-processing into the model post-training. We're currently implementing that sort of capability. It would be run as a one-off via a python script to update the model. That way the resize and channel/layout transpose would be handled within the InferenceSession.Run by optimized code that is parallelized where applicable (including running that processing on GPU if that's enabled). I would also guess that ML.NET has pretty optimized implementations of these common pre-processing steps, hence the overhead of doing that via their pipeline is likely a lot lower than the primitive C# code from our examples. e.g. instead of setting individual pixels you can block copy sections depending on what the transpose needs to do.

Does using ML.NET let you take advantage of ORT's support for concurrent calls and batching? What is the relative importance of the the per-Run latency vs the throughput as concurrent requests and/or batching would help with throughput.

For the device copy overhead to use CUDA, batching may amortize some of that. How much that copy costs though is probably device dependent. Testing on a laptop may have a completely different overhead to hardware that is purely focused on processing an image using the GPU. That said, your measurement for inferencing time includes this device copy overhead and benchmark 3 was significantly faster for this than benchmark 2 (the call to Run is handling the copy to/from CUDA).

pranavsharma commented 2 years ago

One other way to find out if the CUDA execution provider is being exercised is to turn on verbose logging and look for "Node placements" in the logs. This would tell you which nodes (if not all) were placed on CUDA.

rvdinter commented 2 years ago

I am experiencing the same issue when first loading the ONNX model, saving it to .zip, and later loading this model. The GPU does not seem to use more VRAM, nor processing power.

michaelgsharp commented 2 years ago

Yeah, I'm one of the devs on the ML.NET side of things. This is actually an issue on our side. After you save/load a model, it no longer uses the GPU and the user doesn't have a way of forcing it to use the GPU. We didn't realize this was an issue until this github issue was made, but we are currently working on a fix for it.

mmayer-lgtm commented 2 years ago

@michaelgsharp Sorry for asking; but is there any news on this one?

noumanqaiser commented 2 years ago

@michaelgsharp Can you confirm that the GPU was not being utilized even in benchmark 3 when the session was initiated using:

using (var session = new InferenceSession(modelPath ,SessionOptions.MakeSessionOptionWithCudaProvider()))

and the initialization alone occupied a good deal of Ram? If yes, any idea how early this issue can be solved so we can exploit the maximum benefits from cuda acceleration?

On a seperate thread, I have raised another issue where onnxruntime.DirectMl is not resulting in any performance gain either. Is the issue you have found generic and preventing all types of hardware acceleration in ML.net(both Cuda and DirectML).

mmayer-lgtm commented 2 years ago

@noumanqaiser

Just one comment about this; not sure if this is helpful.

With an object detection model which is created and trained through Model Builder in Visual Studio (and Azure), the Inference is at a factor of ~5 faster with CUDA than with just CPU running on my noteboox (Nvidia Quadro T2000) (~130ms vs. ~700ms for session.run() step)

Of course (as you already said) the Execution Provider for CUDA needs to added:

         SessionOptions options = new SessionOptions();
        options.LogSeverityLevel = OrtLoggingLevel.ORT_LOGGING_LEVEL_WARNING;//_VERBOSE
        options.AppendExecutionProvider_CUDA(0);

        var session = new InferenceSession(@"D:\MLNET\Project2cpp\MLModel1.onnx",options);
        using IDisposableReadOnlyCollection<DisposableNamedOnnxValue> results = session.Run(inputs);

The installed Cuda and cuDnn Versions are the ones mentioned here for version 1.10 of the Onnxruntime: https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html

image

michaelgsharp commented 2 years ago

You should be able to see benefit using the GPU in ML.NET until the model has been saved. After the saving is where the current issue comes into play where you cannot specify running on the GPU.

That appears to be what is happening as @mmayer-lgtm is seeing the benefit when using model builder (which is before the ML.NET model has been saved/loaded.). Loading the model after its been trained by model builder should then see only the CPU being used.

We are working on a fix for this on the ML.NET side.

noumanqaiser commented 2 years ago

@mmayer-lgtm As you said, calling OnnxRuntime.GPU directly and initiating a session with CUDA execution provider helps you achieve improvement in inference time, but I noticed that this benefit is eaten by inefficient image preprocessing methods(resizing the bitmap and transposing it pixel by pixel into format expected by the model). I tried out various methods but it seems ML.NET is almost always offering superior performance for this preprocessing process.

Given this, it would be optimal to use ML.NET to optimize overall execution time(preprocessing+inferencing).

@michaelgsharp I got it, In my case I would always start with a saved model, and hence the observation. Any idea by when this issue can be closed? Would you be releasing a new Nuget package for this fix?

stale[bot] commented 2 years ago

This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

noumanqaiser commented 2 years ago

You should be able to see benefit using the GPU in ML.NET until the model has been saved. After the saving is where the current issue comes into play where you cannot specify running on the GPU.

That appears to be what is happening as @mmayer-lgtm is seeing the benefit when using model builder (which is before the ML.NET model has been saved/loaded.). Loading the model after its been trained by model builder should then see only the CPU being used.

We are working on a fix for this on the ML.NET side.

Hi @michaelgsharp Can you confirm if this fix has been rolled out in ML.net?

lawrence-laz commented 1 year ago

@michaelgsharp is the fix to this problem not yet available?

I'm using Microsoft.ML 1.7.1 and Microsoft.ML.OnnxRuntime.Gpu 1.9 and despite setting up CUDA according to ms docs the loaded model is still using CPU rather than GPU.

Maybe the fix is available in Microsoft.ML 2-preview? And if so, is there a compatibility matrix for Microsoft.ML / Microsoft.ML.OnnxRuntime.Gpu / CUDA / cudnn versions?

QuinnDamerell commented 1 year ago

I'm wondering a simular thing. I'm doing image classification with ML.NET and onnx models; it would be nice to be able to configure ML.NET to use the GPU.

iXab3r commented 1 year ago

any updates on this ?