microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.87k stars 2.94k forks source link

Getting "Unable to load DLL 'onnxruntime'" when using ML.NET on Azure Cloud Service #4896

Closed AntonVasserman closed 3 years ago

AntonVasserman commented 4 years ago

Describe the bug Our service is deployed on Azure Cloud Service. Our service is written in C# and uses .NET Framework 4.6.2. We are using ML.NET by consuming the next NuGet packages (and only them): Microsoft.ML.OnnxRuntime 1.4.0 and Microsoft.ML.OnnxRuntime.Managed 1.4.0. We get the next exception: The type initializer for 'Microsoft.ML.OnnxRuntime.NativeMethods' threw an exception. at Microsoft.ML.OnnxRuntime.SessionOptions.Dispose(Boolean disposing) at Microsoft.ML.OnnxRuntime.SessionOptions.Finalize() Inner Exception: Unable to load DLL 'onnxruntime': The specified module could not be found. (Exception from HRESULT: 0x8007007E) at Microsoft.ML.OnnxRuntime.NativeMethods.OrtGetApiBase() at Microsoft.ML.OnnxRuntime.NativeMethods..cctor() ' At first this exception occurred during UnitTests and we fixed that by manually adding "x64" to our csproj xml file. Now the project creates the "onnxruntime.dll" as it suppose, it runs locally, on Azure Devops build and on Azure Devops deployment just fine without any issues, but on Azure Cloud Service when it tries to initialize the instances that run the service it fails and gives the exception written above.

Urgency This blocks us from introducing new features which improve the service.

System information

To Reproduce Currently isn't able to reproduce locally, only when deploying to Azure Cloud Service.

Expected behavior Expected to run perfectly as we see the "onnxruntime.dll" both locally and on the deployment at Azure Devops by checking the artifacts.

Screenshots Here is a screenshot showing the deployment succeeds with no issues: image

Here is a screenshot of the "onnxruntime.dll" in the artifacts of the Azure Devops build being deployed: image

Here is a screenshot of the "onnxruntime.dll" in the package deployed to the Azure Cloud Service: image

Here is a screenshot of the instances continuing to fail and retry: image

Here is a screenshot of the exception we see in the Azure Cloud Service portal: image

ytaous commented 4 years ago

Hi, AntonVasserman, thanks for you feedback. Can you please share more detailed on the call stack and the ML.NET version you are using?
Cheers.

ytaous commented 4 years ago

Hi, a couple more requests:

  1. can you please try our nightly packages and see if the issue persists? (https://github.com/microsoft/onnxruntime#binaries)
  2. can you please share your c# sample code to show the c# api you are using? Thanks.
AntonVasserman commented 4 years ago

Thanks for the fast reply! After installing the latest dev version and deploying to our Cloud Service I can see that the instances do start to work (this is progress!!!), but the issue continues now on the service it self.

This is the exception we see each time the service tries to use the method which uses the ML model: outerType: System.TypeInitializationException outerMessage: The type initializer for 'Microsoft.Azure.Monitoring.LSA.Shared.OptimizationFlowUtils' threw an exception. stackTrace: at Microsoft.Azure.Monitoring.LSA.Shared.OptimizationFlowUtils.ShouldUpdateOptimizationFlowProperties(OptimizationFlowProperties optimizationFlowProperties, String tableListInQuery) at Microsoft.OI.Alerting.Common.Adapters.ScheduledSearchAdapter.d10.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at Microsoft.OI.Alerting.Common.Adapters.ScheduledSearchAdapter.d10.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Runtime.CompilerServices.TaskAwaiter1.GetResult() at Microsoft.OI.Alerting.Common.Processor2.d29.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult() at Microsoft.OI.Alerting.Common.Processors.ExecuteSearchProcessor.d2.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Runtime.CompilerServices.TaskAwaiter1.GetResult() at Microsoft.OI.Alerting.Common.Processor2.d29.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult() at Microsoft.OI.Alerting.ProcessingRole.Runners.QueueWorker.d35.MoveNext() innermostType: System.DllNotFoundException innermostMessage: Unable to load DLL 'onnxruntime': The specified module could not be found. (Exception from HRESULT: 0x8007007E)

The source of the exception is from our "Shared" project which has the NuGets installed, the Worker Role project which is "ProcessingRole" also has those NuGets installed. This is the usage of the NuGets in our code: image We are creating an instance of KustoQueryAnalyzer with an instance of the wrapper of the model which is called IsQueryIncreasingMLModel. image Here we create an InferenceSession and use it to get the score of the model.

Regarding what ML.NET version we use, we use the two NuGets provided, we do not use the Microsoft.ML NuGet, should we install it as well? We tried to but it didn't help, now that the dev version makes the instance work will the Microsoft.ML NuGet help somehow?

skottmckay commented 4 years ago

It's hard to see from the call stack what the usage of ORT is and what could be incorrect. Given using a newer build where multiple calls to Dispose are handled more correctly got past the first issue, I'm wondering if there's an issue with not being precise about how and when various ORT related types are disposed in your C# code. IDisposable is complex to get correct, so there may be edge cases we're not handling well that can be avoided if your C# code disposes things explicitly.

Are you manually calling Dispose for all IDisposable types, or wrapping IDisposable types in using(...) {} blocks where possible?

Is your IsQueryIncreasingMLModel class IDisposable? Not clear from the code snippet if IMLModel inherits from IDisposable. Given that class contains a member that is IDisposable (_inferenceSession) I would expect it should be.

I'd also suggest running Visual Studio Code Analysis to see whether it flags anything else.

AntonVasserman commented 4 years ago

Providing the entire code:

public class IsQueryIncreasingMLModel : IMLModel
{
    private const string ModelsFolderName = "MLModels";
    private const string ModelFileName = "onnx_model.onnx";

    private readonly InferenceSession _inferenceSession;

    public IsQueryIncreasingMLModel()
    {
        _inferenceSession = new InferenceSession(Path.Combine(Directory.GetCurrentDirectory(), ModelsFolderName, ModelFileName));
    }

    /// <inheritdoc />
    public double GetScore(List<long> features, ITracer tracer = null)
    {
        try
        {
            var onnxValues = new List<NamedOnnxValue>();

            foreach (string name in _inferenceSession.InputMetadata.Keys)
            {
                // In this case the model has features in the second parameter and the first parameter is -1 to signify 
                // Possible running of several queries in a batch, but we run a single evaluation so we set this to 1
                onnxValues.Add(NamedOnnxValue.CreateFromTensor(
                    name,
                    new DenseTensor<long>(
                        features.ToArray(),
                        new int[2]
                        {
                            1,
                            _inferenceSession.InputMetadata[name].Dimensions[1]
                        })));
            }

            using (IDisposableReadOnlyCollection<DisposableNamedOnnxValue> outputs = _inferenceSession.Run(onnxValues))
            {
                foreach (DisposableNamedOnnxValue output in outputs)
                {
                    if (output.Name.Equals("output_probability"))
                    {
                        return output.AsEnumerable<NamedOnnxValue>()
                            .First()
                            .AsEnumerable<KeyValuePair<long, float>>()
                            .ElementAt(1)
                            .Value;
                    }
                }
            }
        }
        catch (Exception e)
        {
            tracer?.LogTrace("Unable to use ml model for prediction");
            tracer?.LogException(e);
        }

        return 0;
    }
}

So the IMLModel doesn't implement IDisposable, and we are actually holding to the same instance of the InferenceSession, should we change that? Should we create a new session each time we use "GetScore"? I believe I read somewhere that this is thread safe so I assumed using the same instance is alright.

Didn't see anything related to IDisposable.

yuslepukhin commented 4 years ago

IsQueryIncreasingMLModel holds a IDisposable resource which is a session. The guideline says that disposable classes must be disposed when no longer needed. In this case, IsQueryIncreasingMLModel has a disposable resource as a member. It, therefore, must implement IDisposable interface itself, dispose of the session in it. Furthermore, the code that makes use of IsQueryIncreasingMLModel must dispose of it when no longer needed.

The bottom line is that unmanaged resources must have clear ownership and must be promptly destroyed when no longer needed to prevent resource leaks. IsQueryIncreasingMLModel above clearly owns the instance of the session but does not destroy it.

skottmckay commented 4 years ago

It's fine to re-use the session, and it is threadsafe. But you do need to dispose it given it is IDisposable. Apart from that the code looks ok.

It's not clear how ORT is involved in the latest issue which seems to be an exception in an async call. Whilst the inner error is about the onnxruntime.dll not being able to be loaded, unlike the first error there's no evidence of ORT being involved in causing that error in the call stack from the exception. Not sure disposing the session correctly is going to change anything here. I guess an experiment would be to temporarily create a new session in each GetScore call and wrap that in a 'using' to see if that changes anything.

AntonVasserman commented 4 years ago

Thanks everyone, tried the experiment. Decided to create the session in each call instead of making the Model IDisposable as for now it is quicker to test. Since it isn't created now on the initialization of the class but on the first call we get a better stack trace to investigate:

System.TypeInitializationException: The type initializer for 'Microsoft.ML.OnnxRuntime.NativeMethods' threw an exception. ---> System.DllNotFoundException: Unable to load DLL 'onnxruntime': The specified module could not be found. (Exception from HRESULT: 0x8007007E) at Microsoft.ML.OnnxRuntime.NativeMethods.OrtGetApiBase() at Microsoft.ML.OnnxRuntime.NativeMethods..cctor() --- End of inner exception stack trace --- at Microsoft.ML.OnnxRuntime.SessionOptions..ctor() at Microsoft.ML.OnnxRuntime.InferenceSession..ctor(String modelPath) at Microsoft.Azure.Monitoring.LSA.Shared.MLModels.IsQueryIncreasingMLModel.GetScore(List1 features, ITracer tracer) at Microsoft.Azure.Monitoring.LSA.Shared.MLModels.KustoQueryAnalyzer.IsIncreasing(String query, Boolean isMetricMeasurement, ITracer tracer) at Microsoft.Azure.Monitoring.LSA.Shared.OptimizationFlowUtils.IsAlertRuleInvariant(LogSearchRuleConfiguration ruleConfig, ScheduleEntity scheduleEntity, ITracer tracer) at Microsoft.Azure.Monitoring.LSA.Shared.OptimizationFlowUtils.CreateOptimizationProperties(LogSearchRuleConfiguration ruleConfig, ScheduleEntity scheduleEntity, Query query, String kustoTableTypeInSearchQuery, ITracer tracer) at Microsoft.OI.Alerting.Common.Adapters.ScheduledSearchAdapter.<ProcessInternalAsync>d__10.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at Microsoft.OI.Alerting.Common.Adapters.ScheduledSearchAdapter.<ProcessInternalAsync>d__10.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Runtime.CompilerServices.TaskAwaiter1.GetResult() at Microsoft.OI.Alerting.Common.Processor2.<ProcessAsync>d__29.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Runtime.CompilerServices.TaskAwaiter1.GetResult() at Microsoft.OI.Alerting.Common.Processors.ExecuteSearchProcessor.d2.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Runtime.CompilerServices.TaskAwaiter1.GetResult() at Microsoft.OI.Alerting.Common.Processor2.d29.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult() at Microsoft.OI.Alerting.ProcessingRole.Runners.QueueWorker.d__35.MoveNext()

The same exception as before but now it is possible to the the OrtGetApiBase method failing.

yuslepukhin commented 4 years ago

Based on the stack above, it looks like the actual ORT library can not be found. I.e. onnxruntime.dll. Please, make sure that it can found. The simplest thing to do is to place it alongside the .exe that is using. The dll will have NVIDIA CUDA dependencies if you are using a GPU package. Those need to be present on the box. Cuda Cudadnn libraries and CUDA drivers. That is the thing to check. here is a list of dependancies from my dev build

KERNEL32.dll
MSVCP140D.dll
VCRUNTIME140D.dll
VCRUNTIME140_1D.dll
ucrtbased.dll
**cublas64_10.dll**
**cudnn64_7.dll**
**curand64_10.dll**
**cufft64_10.dll**
ADVAPI32.dll
SHLWAPI.dll
dbghelp.dll
**cudart64_102.dll**
skottmckay commented 4 years ago

ORT is importing the onnxruntime dll using the DllImport attribute.

https://github.com/microsoft/onnxruntime/blob/438babd966278ae1331bffbf369429199b0bd028/csharp/src/Microsoft.ML.OnnxRuntime/NativeMethods.cs#L310-L311

'nativeLib' is set to 'onnxruntime' as I believe it needs to be extension free so it works on multiple platforms (e.g. .so on linux, .dll on windows).

It looks like the latest code is just taking a slightly different path and the same underlying issue remains. If you look closely at all the issues where it has ORT in the call stack, the exception is coming from the static constructor for the class that provides the native methods (Microsoft.ML.OnnxRuntime.NativeMethods..cctor), so it was most likely never able to load the library in all cases.

If this was fixed by being more explicit about the platform in your local unit testing setup, possibly there is something different in the config for Azure Cloud Service that is causing a mismatch. Or possibly there's some other aspect of Azure Cloud Service causing it not to load. Given we're using standard C# functionality to load the dll, and it works everywhere but Azure Cloud Service, it's probably worth asking them for advice at this point.

AntonVasserman commented 4 years ago

Since we do see the dll in the package deployed it must be an issue somewhere in the Azure Cloud Service. As @yuslepukhin suggested we will next try to verify that the cloud service contains all needed dependencies.

AntonVasserman commented 4 years ago

We have finally managed to make this work. The solution though was to not use the 1.4 version, but the 1.2 version of those NuGets. I am still not sure if the issue is on the dll side or something related to Azure Cloud Services, although it does seem to work fine with 1.2.

I'm going to work with Azure Cloud Service to see if they find the issue, and would like to take this offline with you guys as well only to narrow down the issue so in the future it will be possible to use 1.4 on Azure Cloud Services.

AntonVasserman commented 4 years ago

After investigating this a bit more with a support engineer from the Visual Studio team we narrowed the issue and found out that our service creates a different onnxruntime.dll file than other projects.

What we did is to create a different console app and use the NuGets (version 1.4.0). The console app runs without any issues. We copied the console app to the Azure Cloud Service machine and it worked there as well. When we replaced the onnxruntime.dll file in the console app bin folder with the one created from our service (also version 1.4.0) we see that the console app fails on the same exception as our service. After looking at the two onnxruntime.dll files we see that the console app created a one with 4,830KB size and the service created a one with 5,545KB size, so we see that they are different although the NuGets are the same.

We couldn't open them with ILDASM and not with ILSpy, getting the next exception: ICSharpCode.Decompiler.Metadata.PEFileNotSupportedException: PE file does not contain any managed metadata. at ICSharpCode.Decompiler.Metadata.PEFile..ctor(String fileName, PEReader reader, MetadataReaderOptions metadataOptions) at ICSharpCode.ILSpy.LoadedAssembly.LoadAssembly(Object state) at System.Threading.Tasks.Task`1.InnerInvoke() at System.Threading.Tasks.Task.Execute()

@skottmckay Could you provide a little info on how this dll is created, what are its dependencies? Maybe you have an idea how we can open them and see the differences?

skottmckay commented 4 years ago

The NuGet package contains runtimes for different platforms. The 4,830KB size one is the x86 build. The 5,545KB one is the x64 build. onnxruntime.dll is a native dll. ILDASM is for managed dlls so won't be able to open the native onnxruntime.dll. It would be able to open the managed Microsoft.ML.OnnxRuntime.dll.

Is there possibly a mismatch and you're building your service as x86 but shipping it with an x64 onnxruntime dll that it can't load?

If it was the other way around (x64 service with x86 dll) I would expect it would still work. Would obviously be better if it matched though.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

yuslepukhin commented 4 years ago

@AntonVasserman Do you have any new information?

AntonVasserman commented 4 years ago

@yuslepukhin Sorry for the delay, due to the previous version solving the issue we had to finish few things before getting back to this.

So regarding new information not so much. As said before it seems like there is a mismatch between the dll we are building and the dll expected. I will now try to repro it again and try to solve it in case it reproduces. Hope to provide an explanation for that in the next few days.

PS. Small question related to performance, it was mentioned in the thread that the InferenceSession is an IDisposable and should be disposed once done using, but it is thread safe. We saw in our service that the initialization of the InferenceSession is a bit expensive regarding to performance. Is it wise (or 100% should be avoided) to use a single static InferenceSession object for all the uses of the ML Model? If it is thread safe of course...

patrickmurray18 commented 3 years ago

@yuslepukhin I'm having a similar issue where when I deploy to a cloud service:

So I get the exception:

Message: Unable to load DLL 'onnxruntime': The specified module could not be found. (Exception from HRESULT: 0x8007007E)

This happens when I deploy on both "Any CPU" and "x64" project settings. This surely must be an issue with your nuget package? I deploy heaps of other packages and they don't have this issue. This is on 1.6.0. I also tried 1.2.0 and similar thing happened (though with huge inner exception trace).

Is there anything I should try?

patrickmurray18 commented 3 years ago

@AntonVasserman Do you have any input to the above? This is proving to be a major hurdle in using Microsoft services for ML for me.

AntonVasserman commented 3 years ago

@patrickmurray18 Sadly we don't have any input, as we are currently moving to work with Azure Service Fabric instead of Azure Cloud Service and it didn't provide us the same issues. As well as using the previous version worked for us in CS (1.2 instead of 1.4). This thread will be closed now as we don't need further assistance, thanks everyone for the insights.

adimex commented 2 years ago

Did you install C++ redist?

tungfpmss commented 2 years ago

Hello everyone! Currently I have the same problem when deploying Onnx with windows 7 x64 professional sp1 (on windows 10 it works properly) I installed and used

Do you have any suggestions for me in this case? Am i missing something?

benjamin32561 commented 1 year ago

I am encountering the same error when running a program through Docker, when I am running the program on my local device it works fine. .net: 6.0 Microsoft.ML.OnnxRuntime: 1.13.1 when running InferenceSession(model_path); i get this exception: DllNotFoundException: Unable to load DLL 'onnxruntime' or one of its dependencies: The specified module could not be found.

KokinSok commented 9 months ago

Onnx is so buggy! I have been using if for sometime and it just has one problem after another! Wish MS would fix and make it work once and for all!

Some fixes:

Requirements All builds require the English language package with en_US.UTF-8 locale. On Linux, install language-pack-en package by running locale-gen en_US.UTF-8 and update-locale LANG=en_US.UTF-8

Windows builds require Visual C++ 2019 runtime. The latest version is recommended.

https://onnxruntime.ai/docs/install/

1 in 100 that this will work. MS Please fix onnx, we need it working!