microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
13.5k stars 2.76k forks source link

InferenceSession init hangs for multiple models on Azure Kubernetes Service #13291

Open alexbarl opened 1 year ago

alexbarl commented 1 year ago

Our team trains and applies various ML models by using ML.NET. Some scenarios use onnx through Microsoft.ML.OnnxRuntime nuget package. Recently we’ve started moving our services to Azure Kubernetes (AKS) were we see that initialization calls to ApplyOnnxModel consistently hang for all models.

Configuration

  1. Microsoft.ML version 1.7.1 packages
  2. Microsoft.ML.OnnxRuntime.Managed and Microsoft.ML.OnnxRuntime version 1.10 packages (for compatibility with services on the previous platform)
  3. The container is based on image from mcr.microsoft.com/dotnet/aspnet:6.0 - Debian GNU/Linux 11 (bullseye)
  4. From what I see AKS host OS is Ubuntu 18.04.6 LTS (Bionic Beaver)
  5. UTF-8 locale is configured through Dockerfile. Without locale configuration an error message in onnxruntime was traced.

The initialization hangs somewhere inside libonnxruntime.so code.

Managed code call stack image

alexbarl commented 1 year ago

Was able to capture some partial information from one of the hanged threads with gdb Not sure if it is helpful:

(gdb) thread 49 [Switching to thread 49 (Thread 0x7f4506ffd700 (LWP 496))]

0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38

38 ../sysdeps/unix/sysv/linux/x86_64/syscall.S: No such file or directory. (gdb) bt

0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38

1 0x00007f47435b4316 in ?? () from /app/runtimes/linux-x64/native/libonnxruntime.so

2 0x00007f47435b2f43 in ?? () from /app/runtimes/linux-x64/native/libonnxruntime.so

3 0x00007f47435b3433 in ?? () from /app/runtimes/linux-x64/native/libonnxruntime.so

4 0x00007f474342874c in ?? () from /app/runtimes/linux-x64/native/libonnxruntime.so

5 0x00007f4743428fa6 in ?? () from /app/runtimes/linux-x64/native/libonnxruntime.so

6 0x00007f474342b518 in ?? () from /app/runtimes/linux-x64/native/libonnxruntime.so

7 0x00007f8f56ee6ea7 in start_thread (arg=) at pthread_create.c:477

8 0x00007f8f56ad5aef in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

pranavsharma commented 1 year ago

Can you isolate AKS and ML.NET and test with ORT's C# APIs only?

yuslepukhin commented 1 year ago

What is the version of Onnxruntime ? Would it be possible to share the model as well?

alexbarl commented 1 year ago

@yuslepukhin, we use onnxruntime 1.10.0 (for compatibility with previous system). I think there was an issue running a newer version, but I can test it again with 1.12.1 or 1.13.0 versions. Regarding the model, I'll check if it can be shared internally.

alexbarl commented 1 year ago

I've ran a debug build that includes #13313 change by @yuslepukhin's (thank you @pranavsharma)

InferenceSession creation didn't crash, but for every loaded model, generated the below error message:

_2022-10-17 00:54:22.369587475 [E:onnxruntime:CSharpOnnxRuntime, env.cc:231 ThreadMain] pthread_setaffinitynp failed for thread: 139629160285952, mask: 1, error code: 22 error msg: Invalid argument. Specify the number of threads explicitly so the affinity is not set.

Not sure what is the impact of the above error or how to fix it.

yuslepukhin commented 1 year ago

This means that the thread affinity fails to be set for at least one thread. Usually this happens when the number of threads is not specified. If the desired number of threads is explicitly set, we do not set affinity. Before, it would throw at thread creation, and we would hang. Not setting thread affinity would not affect the functionality but may affect performance. We may need to refine the code further, the thread ID is not queried correctly.