Open martincostello opened 1 month ago
@MarcoRossignoli I think it's another bug of pipe name under non-windows, could you check it please?
Thanks @martincostello for reporting it, I'll check the length name issue, anyway I don't know if hang/crash works for native aot version.
@brianrob @tommcdon we're using the environment variable for the crash dump https://learn.microsoft.com/en-us/dotnet/core/diagnostics/collect-dumps-crash and the Microsoft.Diagnostics.NETCore.Client
to take a dump in case of hang, is it working in native aot mode?
Sorry I totally forgot about this ticket, will work on it this week!
@martincostello Would you be able to test the fix from our preview feed? I tried cloning your repo to test it out but running build.ps1
on Ubuntu, I get
LondonTravel.Skill failed with 2 error(s) (24.3s) → artifacts/bin/LondonTravel.Skill/release_linux-arm64/bootstrap.dll
clang : error : linker command failed with exit code 1 (use -v to see invocation)
/home/amaury/.nuget/packages/microsoft.dotnet.ilcompiler/8.0.6/build/Microsoft.NETCore.Native.targets(366,5): error MSB3073: The command ""clang" "/home/amaury/alexa-london-travel/artifacts/obj/LondonTravel.Skill/release_linux-arm64/native/bootstrap.o" -o "/home/amaury/alexa-london-travel/artifacts/bin/LondonTravel.Skill/release_linux-arm64/native/bootstrap" -Wl,--version-script=/home/amaury/alexa-london-travel/artifacts/obj/LondonTravel.Skill/release_linux-arm64/native/bootstrap.exports -Wl,--export-dynamic -gz=zlib -fuse-ld=bfd /home/amaury/.nuget/packages/runtime.linux-arm64.microsoft.dotnet.ilcompiler/8.0.6/sdk/libbootstrapper.o /home/amaury/.nuget/packages/runtime.linux-arm64.microsoft.dotnet.ilcompiler/8.0.6/sdk/libRuntime.WorkstationGC.a /home/amaury/.nuget/packages/runtime.linux-arm64.microsoft.dotnet.ilcompiler/8.0.6/sdk/libeventpipe-disabled.a /home/amaury/.nuget/packages/runtime.linux-arm64.microsoft.dotnet.ilcompiler/8.0.6/sdk/libstdc++compat.a /home/amaury/.nuget/packages/runtime.linux-arm64.microsoft.dotnet.ilcompiler/8.0.6/framework/libSystem.Native.a /home/amaury/.nuget/packages/runtime.linux-arm64.microsoft.dotnet.ilcompiler/8.0.6/framework/libSystem.IO.Compression.Native.a /home/amaury/.nuget/packages/runtime.linux-arm64.microsoft.dotnet.ilcompiler/8.0.6/framework/libSystem.Net.Security.Native.a /home/amaury/.nuget/packages/runtime.linux-arm64.microsoft.dotnet.ilcompiler/8.0.6/framework/libSystem.Security.Cryptography.Native.OpenSsl.a --target=aarch64-linux-gnu -g -Wl,-rpath,'$ORIGIN' -Wl,--build-id=sha1 -Wl,--as-needed -pthread -ldl -lz -lrt -lm -pie -Wl,-pie -Wl,-z,relro -Wl,-z,now -Wl,--eh-frame-hdr -Wl,--discard-all -Wl,--gc-sections" exited with code 1.
I can try it out tomorrow, but these are the dependency steps the CI runs to install the tooling that native AoT needs on linux: https://github.com/martincostello/alexa-london-travel/blob/9a799b548274c270f8dc61417265d04e5cc19a9d/.github/workflows/build.yml#L65-L74
It's ok, I can reproduce the error with these lines :) Thanks @martincostello <3
We fixed the length issue of the pipe but looks like there is still an issue. We haven't yet validated if hang dump is working well in Native AOT mode so for now the only recommendation would be to not use it for NAOT
We will work on adding tests for it on our next iteration and will post-back the resutls.
the only recommendation would be to not use it for NAOT
The reason I found this bug in the the first place was I was specifically trying to diagnose a hang that is only happening to me in native AoT 😅
The reason I found this bug in the the first place was I was specifically trying to diagnose a hang that is only happening to me in native AoT 😅
We're waiting some internal info on how to support hang/crash in native aot mode. We'll let you know soon as possible.
The reason I found this bug in the the first place was I was specifically trying to diagnose a hang that is only happening to me in native AoT 😅
@martincostello do you mean that if you run test in "normal mode" everything is good but if you try with native aot it's hanging?
@brianrob @tommcdon we're using the environment variable for the crash dump https://learn.microsoft.com/en-us/dotnet/core/diagnostics/collect-dumps-crash and the Microsoft.Diagnostics.NETCore.Client to take a dump in case of hang, is it working in native aot mode?
Sorry for the delayed response! Environment variable enabled crash dumps and dotnet-dump
adhoc dump collection is possible with NativeAOT if we include a copy of .NET's createdump
tool along with the application. The steps needed to collect dumps with NativeAOT are included in https://github.com/dotnet/diagnostics/issues/4150, copy/pasted here:
In .NET 8, using createdump
for native AOT applications require some manual steps:
createdump
. The .NET core version of createdump
was modified to work with native AOT applications.cc @LakshanF @agocke
do you mean that if you run test in "normal mode" everything is good but if you try with native aot it's hanging?
Exactly that.
@martincostello - if you publish the tests for AOT do you get any trim/AOT warnings?
Those are not warnings - just info messages. It seems there are no warnings, unless they're suppressed by something.
@vitek-karas I haven't seen any warning when building locally. I will double check the configuration to ensure the analyzers are enabled correctly.
@martincostello Hi martin, for debugging Native AOT apps in particular, you might find native tooling better. For instance, you could use GDB or LLDB to attach to the running process and dump the stack directly. As long as symbols are on the machine, you shouldn't need any special technology to do basic investigations. The native compiler should be able to understand the native code.
Specifically I only seem to get the hangs in the native AoT tests themselves.
The AoT application itself is fine, and I don't get the hangs in normal xunit tests, or on Windows or macOS under native AoT. It seems to be exclusively my native AoT tests on Linux, and then only sometimes (and those tests are 99% copy-paste from existing non-AoT tests for xunit).
It's mostly just a minor annoyance (a few retries and it'll go away), but I'd like to get to the bottom of it to know if it's an issue in my code, or in the native AoT test SDK itself.
Using the built-in tooling seemed like the easiest way to grab a file to inspect to find what was hanging from my CI (which is where I experience the issue), but that ultimately lead me to opening this issue as I fell at the first hurdle.
I assume your Native aot tests are also Native aot? Or are they regular coreclr?
They use the native AoT support in MSTest.TestFramework.
Got it. Yup, in that case I would recommend trying to attach with LLDB or GDB. While the dotnet-dump
tools have some limited support for Native AOT, the native tools may actually be better in this case.
I might try that, but I'll have to learn LLDB/GDB as I've never used them before. That depends on me being able to reproduce it locally - if it only happens in CI then that's basically a non-starter.
Agreed. I’ll leave this issue open. Unfortunately the person who worked on createdump support with Native aot is out on vacation right now, so it might be a bit before we have a more detailed fix.
OK, I've investigated and have some more information. There may be multiple ways that the runtime creates dump files (I haven't surveyed all the possible options) but one is a program called createdump
that's bundled with the .NET install. That does seem to work to create full dump files.
What I'm unclear on is: how does the Microsoft.Testing.Extensions.HangDump
system work? Does it invoke createdump directly?
What I'm unclear on is: how does the Microsoft.Testing.Extensions.HangDump system work? Does it invoke createdump directly?
As reported here https://github.com/microsoft/testfx/issues/3097#issuecomment-2168650810 we use the env vars for the crash dump and the package Microsoft.Diagnostics.NETCore.Client
to take a dump of the process in case of hang.
OK, I managed to investigate the current state and get a dump out of dotnet-dump
, which I think means the same thing can be done using the Microsoft.Diagnostics.NETCore.Client
library.
Here are the requirements:
<EventSourceSupport>true</>
. This enables the event pipe that allows another process to talk to the AOT project.createdump
binary needs to be next to the AOT binary. That lives either in the SDK installation, or in the Microsoft.NETCore.App.runtime.<rid>
runtime pack.createdump
needs to be executable.DOTNET_DbgEnableMiniDump=1
. I think the NETCore.Client library should handle this.What I think we need to improve on the Native AOT side is error messages. There's no indication what the cause of some of these failures are.
Describe the bug
Following this tip https://github.com/microsoft/testfx/issues/3095#issuecomment-2166002542 I added the Microsoft.Testing.Extensions.HangDump package to a native AoT test project of mine to try and diagnose a hanging test (which in my case I can only repro on Linux).
Upon making the changes https://github.com/martincostello/alexa-london-travel/pull/1298 and running the CI, the tests fail with a unique exception message each on macOS and Linux as shown below. I can also repo this outside of GitHub Actions CI with WSL.
macOS
Linux
Steps To Reproduce
./build.ps1
in the root of the repository.Expected behavior
Either:
Actual behavior
The process exits with an error of either
ArgumentOutOfRangeException
on macOS orIOException
on Linux.Additional context
In case it was an issue with the file name I was providing, I tried not specifying a file name at all in the hope of it having a default like
dotnet test
does. In that case, a different failure is observed: