Open alexbarl opened 1 year ago
Was able to capture some partial information from one of the hanged threads with gdb Not sure if it is helpful:
(gdb) thread 49 [Switching to thread 49 (Thread 0x7f4506ffd700 (LWP 496))]
38 ../sysdeps/unix/sysv/linux/x86_64/syscall.S: No such file or directory. (gdb) bt
Can you isolate AKS and ML.NET and test with ORT's C# APIs only?
What is the version of Onnxruntime ? Would it be possible to share the model as well?
@yuslepukhin, we use onnxruntime 1.10.0 (for compatibility with previous system). I think there was an issue running a newer version, but I can test it again with 1.12.1 or 1.13.0 versions. Regarding the model, I'll check if it can be shared internally.
I've ran a debug build that includes #13313 change by @yuslepukhin's (thank you @pranavsharma)
InferenceSession creation didn't crash, but for every loaded model, generated the below error message:
_2022-10-17 00:54:22.369587475 [E:onnxruntime:CSharpOnnxRuntime, env.cc:231 ThreadMain] pthread_setaffinitynp failed for thread: 139629160285952, mask: 1, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.
Not sure what is the impact of the above error or how to fix it.
This means that the thread affinity fails to be set for at least one thread. Usually this happens when the number of threads is not specified. If the desired number of threads is explicitly set, we do not set affinity. Before, it would throw at thread creation, and we would hang. Not setting thread affinity would not affect the functionality but may affect performance. We may need to refine the code further, the thread ID is not queried correctly.
Our team trains and applies various ML models by using ML.NET. Some scenarios use onnx through Microsoft.ML.OnnxRuntime nuget package. Recently we’ve started moving our services to Azure Kubernetes (AKS) were we see that initialization calls to ApplyOnnxModel consistently hang for all models.
Configuration
The initialization hangs somewhere inside libonnxruntime.so code.
Managed code call stack![image](https://user-images.githubusercontent.com/73975235/195241403-2a6cdc05-aea9-4455-b759-02c60e1f7cdf.png)