microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.55k stars 2.91k forks source link

ORT 1.14 Release Candidate available for testing #14431

Closed faxu closed 1 year ago

faxu commented 1 year ago

ORT 1.14 will be released in early February. Release candidate builds are available now for testing. If you encounter issues, please report them by responding in this issue.

Release branch: rel-1.14.0 Release manager: @rui-ren

Pypi Nuget npm Maven (Java)
CPU: 1.14.0.dev20230119003
GPU: 1.14.0.dev20230119003
CPU: 1.14.0-dev-20230123-0954-b51415b0ea
GPU (CUDA/TRT): 1.14.0-dev-20230123-0954-b51415b0ea
DirectML: 1.14.0-dev-20230123-0747-b51415b0ea
WindowsAI: 1.14.0-dev-20230120-1231-b51415b0
onnxruntime-node: 1.14.0-dev.20230119-b51415b0ea
onnxruntime-react-native 1.14.0-dev.20230119-b51415b0ea
onnxruntime-web 1.14.0-dev.20230119-b51415b0ea
CPU: 1.14.0-rc1
GPU: 1.14.0-rc1
canaxx commented 1 year ago

Thank you for the good news;

"Microsoft.ML.OnnxRuntime.DirectML 1.14.0-dev-20230123-0747-b51415b0ea" RC Nuget package has many problems.

  1. It only works for "x64" target platform.
  2. Code stops if you try inferencing in a seperate Task for paralellism.
  3. You can not use this DML package with CUDA nuget package in same project.

Please fix those issues before RTM packages are released. Former stable version also lacks same issues and problem blocks .net users to enjoy this lovely backend.

sumitsays commented 1 year ago

Hi @canaxx ,

  1. The NuGet package does include the dll for other architectures (like x86, arm64, etc) and the package also has dependency on DirectML1.10.0, which also has .dll for all corresponding architectures. Screenshots: onnxruntime package downloaded from above link: image

    DirectML 1.10.0 image

    Can you please share how are you trying to use it for other architectures?

  2. Can you please share more details on what does "Code stops" refer to? Is it crashing on ORT side or is it a DML problem? Call stack and error message would really help here.

  3. Thank you for the suggestion. ORT publishes binaries only for selected combinations. But if you want to have the binaries for a specific combination like (CUDA + DML), you can always build it from source. Reference: https://onnxruntime.ai/docs/build/eps.html

canaxx commented 1 year ago

Hi @sumitsays

Can you please check and compare the content in each platform folder ? You will clearly see that only "x64" platform folder is different than other platforms (for DML backend)

For you to review the ongoing discussions on the list that i gave in my previous message, please kindly check below issues list ;

Preview Result : "Microsoft.ML.OnnxRuntime.DirectML 1.14.0-dev-20230123-0747-b51415b0ea" RC Nuget package has many problems.

  1. It only works for "x64" target platform.
  2. Code stops if you try inferencing in a seperate Task for paralellism.
  3. You can not use this DML package with CUDA nuget package in same project.
  4. you can not build DML nuget package on your own.

Issues List Pointing above Comment ;

14376

13429

14378

14388

elephantpanda commented 1 year ago

Hi I've tried the latest build Microsoft.ML.OnnxRuntime.DirectML.1.15.0-dev-20230128-0210-7aecb2150f

Unfortunately there is still a massive memory leak. For example for 1GB model DirectML: Using c# runtime

SessionOptions so = new SessionOptions
                {
                    ExecutionMode = ExecutionMode.ORT_SEQUENTIAL,
                    GraphOptimizationLevel = GraphOptimizationLevel.ORT_ENABLE_EXTENDED
                };
                 so.AppendExecutionProvider_DML();
session = new InferenceSession("mymodel.onnx",so)   //Adds 1GB VRAM and 1GB RAM
session.Dispose()   //Clears 1GB VRAM  but doesn't clear 1GB RAM

In an IDE, you such as Unity you will essentially be running this many times over so each time increasing the RAM by 1GB until the program crashes.

Fix I think I have found the fix for this. Make sure your onnx file is a single file and not have a separate weights.pb file. It seems that these auxiliary files are not being unloaded from RAM. Hopefully this is a simple bug fix.

P.S

In terms of the thread blocking as stated above. This may be specific thing to do with running in Unity DirectX 12 mode. Running in DirectX 11 mode I didn't get this problem. My guess is that the DirectML is somehow blocking all DirectX12 processing. IDK. (This might be a problem if you want to run inference in the background while displaying other 3D graphics.) Perhaps recreate this by trying to draw a rotating 3D cube in DirectX 12 while trying to do inference on another thread.

elephantpanda commented 1 year ago

A second issue still present in the latest build for DirectML is the following:

When running an inference session of an input, the session becomes optimised for that particular batch number.

This means that running inference session for inputs of a different batch number, the session is slower than if all your previous inputs were of that batch number.

This means to get optimal speed for running inputs of a different batch size, you have to reload the session.

example: running 10 inputs with batch-number=2 (then reload the session for optimal speed) run 10 inputs with batch number=1

Without reloading the session in between, the second set of inputs will run slower.

This is only an issue with DirectML and not CUDA.

Why does this happen? IDK. I'm guessing that the first time you run the session, it sets up specific memory for that size input, so that if you run the session again with a different size input, it has to do some real-time conversion. That would be my guess.

Why is this a problem? An example is using Stable Diffusion. Sometimes you want to run it "guided" which involves sending two batches at a time and sometimes "unguided" which involves sending one batch at a time. To switch between these modes we shouldn't have to reload the Unet model for optimal efficiency.

To be precise a batch of 1 might be tensor of size (1,4,64,64) and a batch of 2 might be tensor of size (2,4,64,64)

A fix for lower end GPUs For lower end GPUs a fix is to never change batch sizes. For stable diffusion instead of sending 2 batches at at time. Just send one after the other sequentially. This also saves some VRAM and for most GPUs will be just as fast. For higher end GPUs you may want to send 2 batches at a time so this is not a fix for everyone. The DirectML people have said they are working on improvements for memory for larger batch sizes.

seddonm1 commented 1 year ago

Hi @faxu.

Do you publish the docs for 1.14.0RC1 anywhere - like a versioned https://onnxruntime.ai/docs/api/c/struct_ort_api.html?

natke commented 1 year ago

Yes, the docs are published here: https://onnxruntime.ai/docs/api/c/

seddonm1 commented 1 year ago

Hi. These docs are not versioned so they do not incorporate all the changes in 1.14 - like the additional TensorrtProviderOptions: https://github.com/microsoft/onnxruntime/blob/rel-1.14.0/include/onnxruntime/core/providers/tensorrt/tensorrt_provider_options.h

natke commented 1 year ago

They are not versioned but they should incorporate all changes, and note where an API has been added in a specific release. Let me look into that specific issue.