microsoft / vstest

Visual Studio Test Platform is the runner and engine that powers test explorer and vstest.console.
MIT License
888 stars 320 forks source link

"Test host process crashed" error hard to diagnose #2952

Open KirillOsenkov opened 3 years ago

KirillOsenkov commented 3 years ago

Our CI has an intermittent failure:

The active test run was aborted. Reason: Test host process crashed
Results File: C:\a\_temp\AzDevOps_2019-6vse00024V_2021-06-24_02_47_47.trx
Test Run Aborted.

As far as I can tell, the message is printed by this code: https://github.com/microsoft/vstest/blob/eff66c00b217b355b6ee11034ff8396a618d04e3/src/Microsoft.TestPlatform.CommunicationUtilities/TestRequestSender.cs#L667

It would be really helpful here if it printed the full path to the process executable that crashed. I took a quick look but I couldn't find where to get the executable full path from. Also it seems that the error output stream contents was empty (clientExitErrorMessage) so when the test process crashes it should print the full exception stack to the Console.Error so that the stack appears in the CI log.

KirillOsenkov commented 3 years ago

FYI @aarnott who has been investigating this recently

KirillOsenkov commented 3 years ago

I have found out what my problem was. It's that infamous 100ms bug striking again, I can't even fathom how much that bug has caused our ecosystem in terms of productivity loss. I personally wasted four full days tracking this down.

I had to add --diag to my dotnet test arguments to publish the diagnostic logs, and then a separate step to upload them as artifacts.

Finally I saw this:

TpTrace Error: 0 : 6860, 7, 2021/06/26, 19:34:34.416, 8580065582, vstest.console.dll, LengthPrefixCommunicationChannel.Send: Error sending data: System.IO.IOException: Unable to write data to the transport connection: An existing connection was forcibly closed by the remote host..
 ---> System.Net.Sockets.SocketException (10054): An existing connection was forcibly closed by the remote host.
   at System.Net.Sockets.NetworkStream.Write(Byte[] buffer, Int32 offset, Int32 size)
   --- End of inner exception stack trace ---
   at System.Net.Sockets.NetworkStream.Write(Byte[] buffer, Int32 offset, Int32 size)
   at System.IO.BufferedStream.Flush()
   at System.IO.BinaryWriter.Flush()
   at Microsoft.VisualStudio.TestPlatform.CommunicationUtilities.LengthPrefixCommunicationChannel.Send(String data).
TpTrace Warning: 0 : 6860, 7, 2021/06/26, 19:34:34.417, 8580070372, vstest.console.dll, ProxyOperationManager: Failed to end session: Microsoft.VisualStudio.TestPlatform.CommunicationUtilities.Interfaces.CommunicationException: Unable to send data over channel.
 ---> System.IO.IOException: Unable to write data to the transport connection: An existing connection was forcibly closed by the remote host..
 ---> System.Net.Sockets.SocketException (10054): An existing connection was forcibly closed by the remote host.
   at System.Net.Sockets.NetworkStream.Write(Byte[] buffer, Int32 offset, Int32 size)
   --- End of inner exception stack trace ---
   at System.Net.Sockets.NetworkStream.Write(Byte[] buffer, Int32 offset, Int32 size)
   at System.IO.BufferedStream.Flush()
   at System.IO.BinaryWriter.Flush()
   at Microsoft.VisualStudio.TestPlatform.CommunicationUtilities.LengthPrefixCommunicationChannel.Send(String data)
   --- End of inner exception stack trace ---
   at Microsoft.VisualStudio.TestPlatform.CommunicationUtilities.LengthPrefixCommunicationChannel.Send(String data)
   at Microsoft.VisualStudio.TestPlatform.CommunicationUtilities.TestRequestSender.EndSession()
   at Microsoft.VisualStudio.TestPlatform.CrossPlatEngine.Client.ProxyOperationManager.Close()
TpTrace Warning: 0 : 6860, 7, 2021/06/26, 19:34:34.417, 8580071127, vstest.console.dll, ProxyOperationManager: Timed out waiting for test host to exit. Will terminate process.

I think it's this issue: https://github.com/microsoft/vstest/issues/2379

https://github.com/microsoft/vstest/blob/d10bcbb28cc3999bcc12758a41a04b998eb9595b/src/Microsoft.TestPlatform.CrossPlatEngine/Client/ProxyOperationManager.cs#L211-L215

The process is busy doing something, then we impatiently kill it after 100ms, AND WE DON'T TELL THE USER WE KILLED IT. All the user sees is:

The active test run was aborted. Reason: Test host process crashed

So then the user is sent on a wild goose chase for 4 days trying to figure out various ways to deploy procdump.exe to the CI agent, find out that it doesn't work anyway because we only explicitly pass StackOverflowException and AccessViolationException as arguments to procdump and ignore other types of exceptions.

We killed the process, we know it, but we don't log it, don't publish the dump, don't have any MSBuild errors, leaving the user helpless, frustrated, blocked, not knowing how to proceed.

nohwnd commented 3 years ago

I agree it could be easier if you use dotnet test. But we also have vstest yaml task (and classic pipeline task) which has all those cool levers, which could have saved you those 4 days.

image

1) It enables crash dump for critical failures (it includes procdump and configures the procdump_path). 2) When running the pipeline, you can click Enable system diagnostics. The task will check that and add diag parameters to the call to vstest. And upload the logs afterwards. You will get them by clicking on ... on the run, and Download logs, our logs have .diag extension. Alternatively you can set syste.debug = true in your pipeline to have this enabled all the time.

image

The error above looks more like testhost crashed, rather than vstest.console just failing to await it. Please upload the logs if you can, or point me to the build. Or is it one of the builds you shared in chat?

KirillOsenkov commented 3 years ago

Yes, it's the build I mentioned in chat.

Things should work regardless of whether I'm using dotnet test or the VSTest yaml task. Where are the docs for the VSTest task? Where are these screenshots from? What is the yaml syntax if I wanted to switch my pipeline from dotnet test to VSTest?

dotnet test experience should be just as well supported as the VSTest task.

KirillOsenkov commented 3 years ago

I've switched to the LocalDumps registry key and will report how that goes.

- powershell: |
    $key = "HKLM:\\SOFTWARE\Microsoft\Windows\Windows Error Reporting\LocalDumps"
    $LogDir = "$(Build.ArtifactStagingDirectory)\build_logs"
    New-Item -Path $key -ErrorAction SilentlyContinue
    New-ItemProperty -Path $key -Name 'DumpType' -PropertyType 'DWord' -Value 2 -Force
    New-ItemProperty -Path $key -Name 'DumpCount' -PropertyType 'DWord' -Value 10 -Force
    New-ItemProperty -Path $key -Name 'DumpFolder' -PropertyType 'String' -Value $LogDir -Force
    displayName: Enable LocalDumps in registry
    continueOnError: true
- powershell: |
    $key = "HKLM:\\SOFTWARE\Microsoft\Windows\Windows Error Reporting\LocalDumps"
    New-Item -Path $key -ErrorAction SilentlyContinue
    New-ItemProperty -Path $key -Name 'DumpType' -PropertyType 'DWord' -Value 0 -Force
  displayName: Disable LocalDumps in registry
  continueOnError: true
  condition: always()
michael-hawker commented 3 years ago

Not sure if this is the same as this issue as well.

We've also just started encountering this error as well, seemingly randomly as we're trying to add one test and it's failing in a completely different error. Log notes here:

TpTrace Verbose: 0 : 6936, 7, 2021/08/02, 22:54:58.235, 23624538797, vstest.console.exe, TestRunRequest:SendTestRunStatsChange: Starting.
TpTrace Verbose: 0 : 6936, 7, 2021/08/02, 22:54:58.236, 23624542076, vstest.console.exe, InProgress is Test_AdvancedCollectionView_Sorting_CustomComparable_With_Shaping
TpTrace Information: 0 : 6936, 7, 2021/08/02, 22:54:58.236, 23624546076, vstest.console.exe, TestRunRequest:SendTestRunStatsChange: Completed.
TpTrace Verbose: 0 : 6936, 7, 2021/08/02, 22:54:58.236, 23624547234, vstest.console.exe, TcpClientExtensions.MessageLoopAsync: Polling on remoteEndPoint: 127.0.0.1:50149 localEndPoint: 127.0.0.1:50148
TpTrace Verbose: 0 : 6936, 7, 2021/08/02, 22:54:59.237, 23634551254, vstest.console.exe, TcpClientExtensions.MessageLoopAsync: Polling on remoteEndPoint: 127.0.0.1:50149 localEndPoint: 127.0.0.1:50148
TpTrace Verbose: 0 : 6936, 7, 2021/08/02, 22:55:00.237, 23644560575, vstest.console.exe, TcpClientExtensions.MessageLoopAsync: Polling on remoteEndPoint: 127.0.0.1:50149 localEndPoint: 127.0.0.1:50148
TpTrace Verbose: 0 : 6936, 7, 2021/08/02, 22:55:01.240, 23654583662, vstest.console.exe, TcpClientExtensions.MessageLoopAsync: NotifyDataAvailable remoteEndPoint: 127.0.0.1:50149 localEndPoint: 127.0.0.1:50148
TpTrace Error: 0 : 6936, 7, 2021/08/02, 22:55:01.416, 23656349603, vstest.console.exe, Socket: Message loop: failed to receive message due to socket error System.IO.IOException: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host. ---> System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host
   at System.Net.Sockets.Socket.Receive(Byte[] buffer, Int32 offset, Int32 size, SocketFlags socketFlags)
   at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size)
   --- End of inner exception stack trace ---
   at System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size)
   at System.IO.Stream.ReadByte()
   at System.IO.BinaryReader.ReadByte()
   at System.IO.BinaryReader.Read7BitEncodedInt()
   at System.IO.BinaryReader.ReadString()
   at Microsoft.VisualStudio.TestPlatform.CommunicationUtilities.LengthPrefixCommunicationChannel.NotifyDataAvailable()
   at Microsoft.VisualStudio.TestPlatform.CommunicationUtilities.TcpClientExtensions.MessageLoopAsync(TcpClient client, ICommunicationChannel channel, Action`1 errorHandler, CancellationToken cancellationToken), remoteEndPoint: 127.0.0.1:50149 localEndPoint: 127.0.0.1:50148

@nohwnd @KirillOsenkov any updates on getting more diagnostics? We've wasted a couple of weeks pulling our hair out trying to understand what's going wrong here.

FYI @azchohfi @huynhsontung

KirillOsenkov commented 3 years ago

Have you tried the LocalDumps registry key to upload dumps from your CI?

Daniellled commented 2 years ago

I was wondering if you had any update on this issue? I am seeing a stack trace that looks just like the one @michael-hawker mentioned.

The active test run was aborted. Reason: Test host process crashed

Test Run Aborted with error System.Exception: One or more errors occurred. ---> System.Exception: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host.. ---> System.Exception: An existing connection was forcibly closed by the remote host. at System.Net.Sockets.NetworkStream.Read(Span1 buffer) --- End of inner exception stack trace --- at System.Net.Sockets.NetworkStream.Read(Span1 buffer) at System.Net.Sockets.NetworkStream.ReadByte() at System.IO.BinaryReader.Read7BitEncodedInt() at System.IO.BinaryReader.ReadString() at Microsoft.VisualStudio.TestPlatform.CommunicationUtilities.LengthPrefixCommunicationChannel.NotifyDataAvailable() at Microsoft.VisualStudio.TestPlatform.CommunicationUtilities.TcpClientExtensions.MessageLoopAsync(TcpClient client, ICommunicationChannel channel, Action`1 errorHandler, CancellationToken cancellationToken) --- End of inner exception stack trace ---.

This worked fine in .NET 5 VS 2019. In .NET 6 VS 2022 I am seeing this issue.

I was wondering if I can provide something to help diagnose this issue. Until this is fixed I will have to turn off my unit test.

nohwnd commented 2 years ago

@Daniellled do you have diag logs that you could share? Please. Add --diag:logs/log.txt to collect them, multiple files will be written into the logs folder. Please share all if you can.

delmyers commented 2 years ago

I think that I'm seeing the same thing when trying to use the blame collector in dotnet test. It is causing my CI to fail tests and not upload any of the results or the blame crashes. There is a simple repro:

1) Create a test project with a single test that just hangs for 10 seconds (use Thread.Sleep()) or something 2) Build the test 3) run dotnet test mytestproj.csproj --blame-hang --blame-hang-dump-type full --blame-hang-timeout 5s

See this output:

A total of 1 test files matched the specified pattern.
The active test run was aborted. Reason: Test host process crashed
Data collector 'Blame' message: The specified inactivity time of 5 seconds has elapsed. Collecting hang dumps from testhost and its child processes.
Data collector 'Blame' message: Dumping 21452 - testhost.

The active Test Run was aborted because the host process exited unexpectedly. Please inspect the call stack above, if available, to get more information about where the exception originated from.
The test running when the crash occurred:
IntegrationTest1.IntegrationTest1.Test1

This test may, or may not be the source of the crash.

Attachments:
  E:\dd\dmshiftleft\WebAppRazorPages\Tests\IntegrationTest1\TestResults\02aefd21-43ee-49db-9235-84cd19e3a678\Sequence_13b9bc7b5d554ea9bdfab637b375ce44.xml
  E:\dd\dmshiftleft\WebAppRazorPages\Tests\IntegrationTest1\TestResults\02aefd21-43ee-49db-9235-84cd19e3a678\testhost_21452_20211130T132430_hangdump.dmp
Test Run Aborted with error System.Exception: One or more errors occurred.
 ---> System.Exception: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host..
 ---> System.Exception: An existing connection was forcibly closed by the remote host.
   at System.Net.Sockets.NetworkStream.Read(Span`1 buffer)
   --- End of inner exception stack trace ---
   at System.Net.Sockets.NetworkStream.Read(Span`1 buffer)
   at System.Net.Sockets.NetworkStream.ReadByte()
   at System.IO.BinaryReader.Read7BitEncodedInt()
   at System.IO.BinaryReader.ReadString()
   at Microsoft.VisualStudio.TestPlatform.CommunicationUtilities.LengthPrefixCommunicationChannel.NotifyDataAvailable()
   at Microsoft.VisualStudio.TestPlatform.CommunicationUtilities.TcpClientExtensions.MessageLoopAsync(TcpClient client, ICommunicationChannel channel, Action`1 errorHandler, CancellationToken cancellationToken)
   --- End of inner exception stack trace ---.

I think that if any collector is attempting to gather data at the end of the test run, then this timeout will be hit and none of the attachments will be uploaded.

nohwnd commented 2 years ago

Was this file not created at all? The error below just means that the testhost was killed (the dumper tool did that because it was hanging), but it is the data collector that is reponsible for waiting for the dump to end and passing the file as attachment to the vstest.console. It shows up on the screen so the file should be created successfully.

E:\dd\dmshiftleft\WebAppRazorPages\Tests\IntegrationTest1\TestResults\02aefd21-43ee-49db-9235-84cd19e3a678\testhost_21452_20211130T132430_hangdump.dmp
delmyers commented 2 years ago

Was this file not created at all? The error below just means that the testhost was killed (the dumper tool did that because it was hanging), but it is the data collector that is reponsible for waiting for the dump to end and passing the file as attachment to the vstest.console. It shows up on the screen so the file should be created successfully.

E:\dd\dmshiftleft\WebAppRazorPages\Tests\IntegrationTest1\TestResults\02aefd21-43ee-49db-9235-84cd19e3a678\testhost_21452_20211130T132430_hangdump.dmp

The file gets created, but it doesn't get copied into the correct location to be uploaded to artifacts in my pipelines. Normally trx files and their corresponding dumps would get uploaded to artifacts. I've had to resort to scanning the temp directory for test results and dumps when the blame collector captures hangs.

sanikolov commented 2 years ago

Getting the same with net6.0 but also net5.0, 3.x and 2.x by varying the global json. Thus the .NET Core version does not seem to matter. I believe that a .NET Core SDK update has resulted in this bug, quite possibly by modifying a global value/location because dotnet test ...<etc>.... used to work for us until recently. Just guessing here, so take the above comment with a grain of salt.

danmoseley commented 2 years ago

In https://github.com/dotnet/runtime/issues/66515 this was because of an AV in the test itself. That's not a vs test bug. But IMO vstest should not dump the NotifyDataAvailable stack at this point. NotifyDataAvailable should catch the exception, which should be an IOException (it's been unwrapped below I believe) since that stack is purely a detail of vstest implementation when the test itself spontaneously quits. It makes it look like vstest has a bug, which in this case it does not.

Testlauf für "E:\external\dotnet\runtime\artifacts\bin\System.IO.FileSystem.Tests\Release\net7.0-windows\System.IO.FileSystem.Tests.dll" (.NETCoreApp,Version=v7.0)
Die Testausführung wird gestartet, bitte warten...
Insgesamt 1 Testdateien stimmten mit dem angegebenen Muster überein.
Der aktive Testlauf wurde abgebrochen. Grund: Der Testhostprozess ist abgestürzt. : Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
   at System.Threading.Thread.get_OptimalMaxSpinWaitsPerSpinIteration()
   at System.Threading.LowLevelSpinWaiter.Wait(Int32, Int32, Int32)
   at System.Threading.LowLevelLifoSemaphore.Wait(Int32, Boolean)
   at System.Threading.PortableThreadPool+WorkerThread.WorkerThreadStart()

Ergebnisdatei: E:\external\dotnet\runtime\artifacts\bin\System.IO.FileSystem.Tests\Release\net7.0-windows\TestResults\robin_ROBIN-PC_2022-03-11_19_37_56.trx
HTML-Testergebnisdatei: E:\external\dotnet\runtime\artifacts\bin\System.IO.FileSystem.Tests\Release\net7.0-windows\TestResults\TestResult_robin_ROBIN-PC_20220311_193756.html

Der Testlauf wurde mit dem Fehler System.Exception: One or more errors occurred.
 ---> System.Exception: Unable to read data from the transport connection: Eine vorhandene Verbindung wurde vom Remotehost geschlossen..
 ---> System.Exception: Eine vorhandene Verbindung wurde vom Remotehost geschlossen.
   at System.Net.Sockets.NetworkStream.Read(Span`1 buffer)
   --- End of inner exception stack trace ---
   at System.Net.Sockets.NetworkStream.Read(Span`1 buffer)
   at System.Net.Sockets.NetworkStream.ReadByte()
   at System.IO.BinaryReader.Read7BitEncodedInt()
   at System.IO.BinaryReader.ReadString()
   at Microsoft.VisualStudio.TestPlatform.CommunicationUtilities.LengthPrefixCommunicationChannel.NotifyDataAvailable()
   at Microsoft.VisualStudio.TestPlatform.CommunicationUtilities.TcpClientExtensions.MessageLoopAsync(TcpClient client, ICommunicationChannel channel, Action`1 errorHandler, CancellationToken cancellationToken)
   --- End of inner exception stack trace --- abgebrochen.
nohwnd commented 2 years ago

I agree, I've been planning to fix that message for a long time because that error is just confusing.

KundM commented 2 years ago

Same issue here with xUnit on Unix using dotnet test (net 6) in CI pipeline. No artifacts from previous builds. On windows (local machine) it is working. Also tried Microsoft.NET.Test.Sdk version 17.2.0-preview-20220401-08 and xunit version 2.4.2-pre.12 but getting the same error.

nohwnd commented 2 years ago

@KundM did you try enabling diag logs, and looking at it? Or collecting crash dumps? What target framework is your project targetting?

SharonTancy commented 2 years ago

Any updates on this issue? I encountered the same error message An existing connection was forcibly closed by the remote host when I'm trying to run a .net 472 test dll right after a .net core 3.1 test dll in the same process(in our own test tool based on VSTest and targeting .net core 3.1). But if I run the net472 test first, everything works fine. I've been struggling for several days figuring this out. I guess it might be a dependency issue or threading issue but not knowing what's happening for detail.

SharonTancy commented 2 years ago

Any updates on this issue? I encountered the same error message An existing connection was forcibly closed by the remote host when I'm trying to run a .net 472 test dll right after a .net core 3.1 test dll in the same process(in our own test tool based on VSTest and targeting .net core 3.1). But if I run the net472 test first, everything works fine. I've been struggling for several days figuring this out. I guess it might be a dependency issue or threading issue but not knowing what's happening for detail.

Just FYI that I've resolved this issue and found out it's due to low version of Microsoft.NET.Test.Sdk. Upgrading it to 16.10.0 for test projects perfectly fixed my issue.

Hittherhod commented 1 year ago

I'm encountering this issue randomly with the latest .NET 6 SDK. It will happen in random test projects (using XUnit) with no predictability (whether the test project itself was modified doesn't always seem to matter when running the whole of dotnet test). Has there been any progress on this? Adding --diag:logs/logs.txt doesn't give me any more information other than that the Test Host failed.

EugeneMac commented 1 year ago

The same issue with Playwright.NET, xUnit and Microsoft.NET.Test.Sdk 17.4.0 in .netcoreapp3.1 project. Tests run on TeamCity agent.

sttobia commented 1 year ago

Hi everyone!

I am running in this issue while running dotnet test on a GitHub self-hosted runner to unit test a C# .Net Framework 4.8 library (SDK style project). The weird thing is, that when running the same command on the same machine locally and manually from a console, everything seems to work fine. However, as soon as it is run by the GitHub Action Service, it directly aborts with the following error message:

Starting test execution, please wait...
A total of 1 test files matched the specified pattern.
The active test run was aborted. Reason: Test host process crashed

Test Run Aborted with error System.Exception: One or more errors occurred.
 ---> System.Exception: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host..
 ---> System.Exception: An existing connection was forcibly closed by the remote host.
   at System.Net.Sockets.NetworkStream.Read(Span`1 buffer)
   --- End of inner exception stack trace ---
   at System.Net.Sockets.NetworkStream.Read(Span`1 buffer)
   at System.Net.Sockets.NetworkStream.ReadByte()
   at System.IO.BinaryReader.Read7BitEncodedInt()
   at System.IO.BinaryReader.ReadString()
   at Microsoft.VisualStudio.TestPlatform.CommunicationUtilities.LengthPrefixCommunicationChannel.NotifyDataAvailable()
   at Microsoft.VisualStudio.TestPlatform.CommunicationUtilities.TcpClientExtensions.MessageLoopAsync(TcpClient client, ICommunicationChannel channel, Action`1 errorHandler, CancellationToken cancellationToken)
   --- End of inner exception stack trace ---.

I also looked at the diag log files (see attached) and could also find the line ProxyOperationManager.Close: waiting for test host to exit for 100 ms which is what seems to have started this thread here. After a whole day of trying to find what the issue is, I still have no clue how to solve this, even though it seems to have been around for a while.

If any of the people involved (@nohwnd or @KirillOsenkov), would be kind enough to have a look at the attached diag log files and give me some advice on how to proceed, I would be very thankful!

log.host.23-04-26_20-19-21_09647_5.txt log.txt

jchannon commented 1 year ago

Also just started getting this error in the last week. Has to disable our tests to get CI running which is not ideal.

This using .NET7 on Linux Gitlab runners

tanevanwifferen commented 1 year ago

We had the same error on our tests, it appeared we had a stack overflow in one of our tests. It was displayed above this error

mlop3s commented 1 year ago

We are experiencing the same error. Same stack . Since 2023.05.16, on two different pipelines. Only happens on C++ tests. If we split the C# and C++ the first will go through, the second will crash. Locally it runs without problems.

Der Testlauf wurde mit dem Fehler System.AggregateException: Mindestens ein Fehler ist aufgetreten. ---> System.IO.IOException: Von der Übertragungsverbindung können keine Daten gelesen werden: Eine vorhandene Verbindung wurde vom Remotehost geschlossen. ---> System.Net.Sockets.SocketException: Eine vorhandene Verbindung wurde vom Remotehost geschlossen bei System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size) --- Ende der internen Ausnahmestapelüberwachung --- bei System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size) bei System.IO.Stream.ReadByte() bei System.IO.BinaryReader.ReadByte() bei System.IO.BinaryReader.Read7BitEncodedInt() bei System.IO.BinaryReader.ReadString() bei Microsoft.VisualStudio.TestPlatform.CommunicationUtilities.LengthPrefixCommunicationChannel.NotifyDataAvailable() bei Microsoft.VisualStudio.TestPlatform.CommunicationUtilities.TcpClientExtensions.MessageLoopAsync(TcpClient client, ICommunicationChannel channel, Action1 errorHandler, CancellationToken cancellationToken) --- Ende der internen Ausnahmestapelüberwachung --- abgebrochen. Gesamtzahl Tests: unbekannt Bestanden: 1977 Nicht bestanden: 373 Gesamtzeit: 6,8843 Minuten`

Starting: VsTest - testAssemblies

Task : Visual Studio Test (Side-by-side) Description : Run unit and functional tests (Selenium, Appium, Coded UI test, etc.) using the Visual Studio Test (VsTest) runner. Test frameworks that have a Visual Studio test adapter such as MsTest, xUnit, NUnit, Chutzpah (for JavaScript tests using QUnit, Mocha and Jasmine), etc. can be run. Tests can be distributed on multiple agents using this task (version 2). Version : 2.220.0 Author : Microsoft Corporation Help : https://docs.microsoft.com/azure/devops/pipelines/tasks/test/vstest

Microsoft (R) Testausführungs-Befehlszeilentool Version 17.4.1 (x64) NUnit Adapter 4.3.1.0: Test execution started

mlop3s commented 1 year ago

We are experiencing the same error. Same stack . Since 2023.05.16, on two different pipelines. Only happens on C++ tests. If we split the C# and C++ the first will go through, the second will crash. Locally it runs without problems.

Der Testlauf wurde mit dem Fehler System.AggregateException: Mindestens ein Fehler ist aufgetreten. ---> System.IO.IOException: Von der Übertragungsverbindung können keine Daten gelesen werden: Eine vorhandene Verbindung wurde vom Remotehost geschlossen. ---> System.Net.Sockets.SocketException: Eine vorhandene Verbindung wurde vom Remotehost geschlossen bei System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size) --- Ende der internen Ausnahmestapelüberwachung --- bei System.Net.Sockets.NetworkStream.Read(Byte[] buffer, Int32 offset, Int32 size) bei System.IO.Stream.ReadByte() bei System.IO.BinaryReader.ReadByte() bei System.IO.BinaryReader.Read7BitEncodedInt() bei System.IO.BinaryReader.ReadString() bei Microsoft.VisualStudio.TestPlatform.CommunicationUtilities.LengthPrefixCommunicationChannel.NotifyDataAvailable() bei Microsoft.VisualStudio.TestPlatform.CommunicationUtilities.TcpClientExtensions.MessageLoopAsync(TcpClient client, ICommunicationChannel channel, Action1 errorHandler, CancellationToken cancellationToken) --- Ende der internen Ausnahmestapelüberwachung --- abgebrochen. Gesamtzahl Tests: unbekannt Bestanden: 1977 Nicht bestanden: 373 Gesamtzeit: 6,8843 Minuten`

Starting: VsTest - testAssemblies

Task : Visual Studio Test (Side-by-side) Description : Run unit and functional tests (Selenium, Appium, Coded UI test, etc.) using the Visual Studio Test (VsTest) runner. Test frameworks that have a Visual Studio test adapter such as MsTest, xUnit, NUnit, Chutzpah (for JavaScript tests using QUnit, Mocha and Jasmine), etc. can be run. Tests can be distributed on multiple agents using this task (version 2). Version : 2.220.0 Author : Microsoft Corporation Help : https://docs.microsoft.com/azure/devops/pipelines/tasks/test/vstest

Microsoft (R) Testausführungs-Befehlszeilentool Version 17.4.1 (x64) NUnit Adapter 4.3.1.0: Test execution started

Update:

We have solved the problem by downgrading the Test Platform Version from 17.6.0 to 17.5.0

Cheers.

nohwnd commented 1 year ago

@sttobia there are no clues in the log unfortunately, the testhost just dies, and vstest.console just says that the connection was closed. Were you able to collect dumps or inspect event viewer after the crash for some callstack?

nohwnd commented 1 year ago

@mlop3s were you able to collect a crash dump? e.g. by adding --blame-crash (you most likely need procdump in your PATH). It would be interesting to see what happened in 17.6, I am currently only aware of this critical issue https://github.com/microsoft/vstest/issues/4467 that affects 17.6, but it does not sound like something you are triggering.

mlop3s commented 1 year ago

@nohwnd We are not using dotnet but vstest and it crahses on C++ tests. I'm not aware of a blame-crash option for vstest.console.exe.

The managed tests are not crashing. Only the native C++. Not sure if it's related.

nohwnd commented 1 year ago

It is vstest.console.exe /Blame:CollectDump, ideally use it together with --diag:logs\log.txt reading the datacollector log makes debugging issues much easier.

qfag-ansc commented 1 year ago

It looks like I have a similar Problem. When I start tests with the vstest.console.exe sometimes there is the "The active test run was aborted. Reason: Test host process crashed" despite all tests did run through.

So I tried to activate the timeout from "here https://github.com/microsoft/vstest/pull/3466/commits"

With setting the Variable in the .runsettings like

<RunSettings>
    <RunConfiguration>
        <EnvironmentVariables>
            <VSTEST_TESTHOST_SHUTDOWN_TIMEOUT>60000</VSTEST_TESTHOST_SHUTDOWN_TIMEOUT>
        </EnvironmentVariables>
    </RunConfiguration>
</RunSettings>

Then I tried setting it in the calling console like

SET VSTEST_TESTHOST_SHUTDOWN_TIMEOUT=60000

Then I tried to pass it directly to vstest like

vstest.console.exe dummy.dll /Settings:dummy.runsetting  /e:VSTEST_TESTHOST_SHUTDOWN_TIMEOUT=60000 

Nothing had any effect despite I could verify that in the Test Environment the Environment Variable was always set. Following Unit Test was successful

[TestMethod]
public void CheckEnvironmentVariables()
{
    var timeout = int.TryParse(Environment.GetEnvironmentVariable("VSTEST_TESTHOST_SHUTDOWN_TIMEOUT"), out var time) ? time : 100;
    Assert.AreNotEqual(timeout, 100);
}

So I tried to enable logging to find out if the timeout was set correctly, but I couldn't find anywhere an EqtTrace with the timeout. Not in any logger, not in the diag log. I even tried to update the vstest.console.exe.config like this

<system.diagnostics>
    <switches>
      <add name="TpTraceLevel" value="0" />
      <add name="EqtTraceLevel" value="4" />
    </switches>
  </system.diagnostics>

So where can I find the EqtTrace Log or how do I set the Environment Variable correctly?

nohwnd commented 1 year ago

You would see the log when you use the /diag:logs/log.txt option. This env variable was shipped in 17.2.0, are you using vstest.console that is newer than that?

Why do you think the error is linked to the exit timeout? This might be just a real crash.

qfag-ansc commented 1 year ago

When executing the tool we get "Microsoft (R) Testausfhrungs-Befehlszeilentool Version 17.6.2 (x64)". So the version should be fine.

I suspect a timeout because there is never a problem when we run the tests directly in visual studio.

I tried to log with /diag as suggest. I still couldn't find the EqtTrace Logs, but I saw something interesting in the TpTraces

Successful run:

grafik

Failed run:

grafik

The cleanup takes a bit longer in the version that failes, but it should wait longer before it reports the run as failed.

sttobia commented 1 year ago

@sttobia there are no clues in the log unfortunately, the testhost just dies, and vstest.console just says that the connection was closed. Were you able to collect dumps or inspect event viewer after the crash for some callstack?

Hi @nohwnd sorry for the late reply, this somehow went by the board. Small update on this: I tested the same code using another personal machine as another GitHub self-hosted runner, with the same VS version installed and there, everything seems to work fine. Also, there I get the same log messages:

TpTrace Verbose: 0 : 9604, 12, 2023/04/28, 18:59:21.253, 349831673531, vstest.console.dll, TestRequestSender.EndSession: Sending end session.
TpTrace Verbose: 0 : 9604, 12, 2023/04/28, 18:59:21.253, 349831677654, vstest.console.dll, ProxyOperationManager.Close: waiting for test host to exit for 100 ms
TpTrace Warning: 0 : 9604, 10, 2023/04/28, 18:59:21.281, 349831956475, vstest.console.dll, TestHostManagerCallbacks.ErrorReceivedCallback Test host standard error line: 
TpTrace Verbose: 0 : 9604, 10, 2023/04/28, 18:59:21.303, 349832179437, vstest.console.dll, TestHostProvider.ExitCallBack: Host exited starting callback.
TpTrace Information: 0 : 9604, 10, 2023/04/28, 18:59:21.304, 349832186416, vstest.console.dll, TestHostManagerCallbacks.ExitCallBack: Testhost processId: 14360 exited with 

So as you mentioned, there really seems to be nothing to learn from the logs. How would I go about "collecting dumps or inspecting event viewer after the crash for some callstack?". I think this is not straightforward as the tests are not being run in VS but using the console of a remote machine in our CI/CD pipeline (GitHub self-hosted runner). If run everything in VS locally, the issues don't arise, as mentioned.

Thanks for your help!

nohwnd commented 1 year ago

@sttobia how are you invoking vstest? Is it via yaml using VSTest task? Or in the "classical" graphical mode? Or running vstest.console directly? Or calling dotnet test directly?

Above you mention Github Action, so I assume you are doing dotnet test? in that case add --blame-crash to your run, that should generate the dump, and then you need to upload it.

sttobia commented 1 year ago

@sttobia how are you invoking vstest? Is it via yaml using VSTest task? Or in the "classical" graphical mode? Or running vstest.console directly? Or calling dotnet test directly?

Above you mention Github Action, so I assume you are doing dotnet test? in that case add --blame-crash to your run, that should generate the dump, and then you need to upload it.

Exactly, I am running dotnet test "my\project" --no-build --output dotnet-my\project\bin\x64\Debug\net48 --verbosity detailed

When adding --blame-crash, it can not find procdump and I have been wasting this whole afternoon, trying to add it to the PATH and/or setting PROCDUMP_PATH, so that GitHub can find it. Unfortunately, without success. Any experience or advice on this?

nohwnd commented 1 year ago

@sttobia The hosted agents have chocolatey on them, choco install --yes procdump should work just fine.

sttobia commented 1 year ago

Thanks @nohwnd! Since the crash only happens on our self-hosted runners, this was not as straightforward. But finally I got it dumping, and you can take a look at both dumps on my drive. You should have gotten an e-mail with an invitation, since I can not upload them here.

nohwnd commented 1 year ago

I can see it fails in xUnit.GetAvailableRunnerReporters because it cannot load a module. I can see the name of the module. Sent you an email.

michael-hawker commented 1 year ago

For UWP/WindowsAppSDK app tests, we see this message and it's not useful. We know it's happening at certain points due to an exception in the app platform which bubbles up to the UnhandledException handler. However, all we get in the CI log is the unable to communicate with test process message, which is less than helpful. We don't see the issue locally, so it's an arduous process to understand why it's only failing in our CI.

It'd be great if the test hosting app crashes, if the currently running test (that caused the failure) is output in the log, as well as the unhandled exception that caused the crash's stack info. Just those two changes alone would make debugging this scenario a world of difference better.

Related to #2593 #3936 #4376

Been trying to add better logging here into the logs themselves, but don't see logs working anymore for some reason, trying to upgrade, but was stopped by https://github.com/microsoft/testfx/issues/1726

nohwnd commented 1 year ago

It'd be great if the test hosting app crashes, if the currently running test (that caused the failure) is output in the log,

This is what --blame does, you will get sequence file which has a list of all finished and unfinished tests. And if you enable --diag you will get .datacollector. log that also has entries for tests that started and ended.

--blame-crash is then there to observe your tests (on windows you need procdump).

as well as the unhandled exception that caused the crash's stack info. Just those two changes alone would make debugging this scenario a world of difference better.

If the runtime writes it to the console of the app that is running we are able to capture that from vstest.console and write it to screen and log. That is what we do for .NET and .NET Framework applications. The experience with .NET is way better because they handle stack overflows and other exceptions "in-process" and write them to the screen with callstack before the process dies. In .NET Framework it simply writes the error without call stack and the process is killed. The info is written to event log, but we don't grab that info, because we cannot know exactly what entry to take and don't want to expose info of other processes.

We can look at how it behaves for UWP if you have a specific crash in mind.

sttobia commented 1 year ago

I can see it fails in xUnit.GetAvailableRunnerReporters because it cannot load a module. I can see the name of the module. Sent you an email.

Quick update from my side: Looking at the procdump files, it looked like an issue with a specific DLL file in my build folder. After letting it rest for a while and then updating all packages at some point (including xunit IIRC), the problem was completely gone. In hindsight the process just crashing, and the dump not really helping out either, was really frustrating. Nevertheless, thank you for the support @nohwnd!

sanikolov commented 12 months ago

Just wanted to point out something that has not been made clear in this thread so far. If your tests interop with a native library either directly or thru a dependent managed DLL, and a call to the native library results in a crash (e.g. exception code 0xC0000005 a.k.a EXCEPTION_ACCESS_VIOLATION) then you're going to get the same exact "Test host process crashed" message during test runs. A prerequisite for adding messages here should be: "have you debugged your tests under VS or windbg?"

jeremy-visionaid commented 12 months ago

@sanikolov Some issues might not manifest in the development environment (e.g. only in CI, or remote machines with different configurations). I think it's reasonable to want to more easily diagnose failures after the fact, espcecially in the case of a simple internal timeout from msbuild.

Personally, I think it would also be helpful here if there were an easy way to set up process level isolation with method level parallelization akin to #3424. Then we could more easily identify the failing test and get a simplified procdump to diagnose. I note the change in isolation level might also prevent the crash from occuring, but that would probably be OK for my affected case (as it is due to third-party proprietary DLLs which the upstream provider won't/is unable to fix).

dqwork commented 11 months ago

Hi all - I was just hoping if someone could either help or summarize the state of this issue.

I'll start with whats brought me here.

We've recently started to migrate to running on linux starting running our unit tests dlls using dotnet test on a linux host.

When running on windows the test past with no issue. On Linux we'd hit this mysterious issue saying the 'Test host process crashed'. Looking into the details logs from our test dll - there were no obvious issues... it just stops.

So we followed some of the steps here added --blame added diagnostic logging etc. We isolated it to one test (that again passed fine on windows). I combed the diagnostic log files that are produced

and all I found was

TpTrace Verbose: 0 : 10226, 10, 2023/10/24, 07:55:32.658, 635444445327132, vstest.console.dll, TcpClientExtensions.MessageLoopAsync: NotifyDataAvailable remoteEndPoint: 127.0.0.1:33932 localEndPoint: 127.0.0.1:34540
TpTrace Error: 0 : 10226, 10, 2023/10/24, 07:55:32.659, 635444446846388, vstest.console.dll, Socket: Message loop: failed to receive message due to socket error System.IO.EndOfStreamException: Unable to read beyond the end of the stream.
   at System.IO.BinaryReader.ReadByte()
   at System.IO.BinaryReader.Read7BitEncodedInt()
   at System.IO.BinaryReader.ReadString()
   at Microsoft.VisualStudio.TestPlatform.CommunicationUtilities.LengthPrefixCommunicationChannel.NotifyDataAvailable()
   at Microsoft.VisualStudio.TestPlatform.CommunicationUtilities.TcpClientExtensions.MessageLoopAsync(TcpClient client, ICommunicationChannel channel, Action`1 errorHandler, CancellationToken cancellationToken), remoteEndPoint: 127.0.0.1:33932 localEndPoint: 127.0.0.1:34540

that exception is repeated a few times. From some searching this appears the the error you get when the test host unexpectedly terminates... but what I can't find is why...

I see near the start of this issue someone mentions that the above stack trace is useless and that should be improved https://github.com/microsoft/vstest/issues/2952#issuecomment-1065479076

However I'm not seeing any real mention of a 'problem' after that. It seems it gets into weeds about adding diag logs etc... but I've done all that and it sheds no light.

Jump ahead a few wasted days... we found a null ref exception in the test. Its handled and causes no issues on windows. However if we prevent that exception being thrown... the tests run fine on linux... no crashed process.

My theory atm is that for some reason that exception is causing some kind of 'freeze' in the process and then its getting terminated because....???

Anyway I've chatted quite enough... can anyone help us with how could go about diagnosing such issues in the future.

(I can share full logs privately, but can't share them via github)

nohwnd commented 11 months ago

added --blame

Have you also added --blame-crash?

On which version of .NET sdk / vstest.console are you? And could you post simple repro of your problem, where was the null ref failing? My experience with crashes on .NET (not .NET Framework) is that you almost always get the full stack trace on screen in the crashing process. In the past we had problems picking this text up in the top level process, because the output streams api behavior differs on windows and linux, but that should have been fixed for a lo0ong time (since netcoreapp3.1 I think).

I see near the start of this issue someone mentions that the above stack trace is useless and that should be improved #2952 (comment)

Printing that error on crash / abort, was addressed, but it still shows up in the diag logs, because sometimes it is useful.

dqwork commented 11 months ago

Hello! I really appreciate you responding.

I did add blame crash and did gott a crash dump - which I'll say two things about.

I probably didn't make this that clear in my original comment - it very much appears that its not a case of my tests 'terminating' - they are being 'terminated' - the tests themselves have no issue - they pass on a windows host 100% of the time.

I mentioned the null reference exception because it did change the outcome on linux - but that null ref is always handled - what we changed was just to not throw it and go down the same path we'd go down when the exception was caught.

If I update my tests to actually fail/crash/have a problem - as you say I do see that on screen.

So not to repeat myself but I do think the issue I'm having is my tests are being terminated by the 'wrapping' test process - not that the tests themselves are terminating. Someone initially posts about an 'infamous 100ms bug' https://github.com/microsoft/vstest/issues/2952#issuecomment-871042574

What bug is this and is there more information on what the fix was/will be and any work arounds.

Regarding versions, heres what I'm running with

vstest.console.dll, Version: 17.0.3+cc7fb0593127e24f55ce016fb3ac85b5b2857fec (this is from the diag.log file)

dotnet sdk version: 6.0.122

I'm running the test dll - by just copying over a .net6.0 output folder to a linux host and running dotnet test ./mytests.dll

tig commented 9 months ago

We had this issue pop up in last few days running our unit tests via a Github action.

   - name: Test
     run: |
       sed -i 's/"stopOnFail": false/"stopOnFail": true/g' UnitTests/xunit.runner.json
       dotnet test --no-restore --verbosity normal --collect:"XPlat Code Coverage"  --settings UnitTests/coverlet.runsettings --blame
       mv -v UnitTests/TestResults/*/*.* UnitTests/TestResults/

I just switched our runs-on from ubunti-latest to windows-latest and the tests now pass.

If I run the tests local in WSL they pass.

rstm-sf commented 6 months ago

Hello, is there any progress? We also encountered this on CI/CD

tig commented 2 months ago

We had this issue pop up in last few days running our unit tests via a Github action.

   - name: Test
     run: |
       sed -i 's/"stopOnFail": false/"stopOnFail": true/g' UnitTests/xunit.runner.json
       dotnet test --no-restore --verbosity normal --collect:"XPlat Code Coverage"  --settings UnitTests/coverlet.runsettings --blame
       mv -v UnitTests/TestResults/*/*.* UnitTests/TestResults/

I just switched our runs-on from ubunti-latest to windows-latest and the tests now pass.

If I run the tests local in WSL they pass.

Checking back on this, hoping something has been fixed in the meantime. It has not.

image