Agent's process stays in memory when it was failed to unload appdomain

CharliePoole commented 8 years ago

@NikolayPianikov commented on Mon Jun 27 2016

From my point of view it could be critical issue

@CharliePoole commented on Mon Jun 27 2016

Is this new with 3.4?

@NikolayPianikov commented on Mon Jun 27 2016

Yes, looks like a regression from 3.2.1

@CharliePoole commented on Mon Jun 27 2016

This is a bit strange. The code at https://github.com/nunit/nunit/blob/master/src/NUnitEngine/nunit.engine/Services/DomainManager.cs#L153 throws an exception if the domain unload throws or if it times out. I have run under console and verified that the exception comes back to the console and is displayed. Maybe the Join is timing out and the Thread.Abort is hanging.

@NikolayPianikov commented on Tue Jun 28 2016

Anyway we should terminate all related agents when the console's process was finished

@CharliePoole commented on Tue Jun 28 2016

Certainly... but we have to figure out how to do that. :-) The code I pointed to is part of the termination and it did change in 3.4.

Ideas:

Look at what changed in 3.4. Of course the change was intended to solve a problem, so we will have to make sure that still gets solved.
Use code from https://github.com/nunit/nunit/blob/master/src/NUnitFramework/framework/Internal/ThreadUtility.cs#L55 which may do a better job of killing the thread.
Agent termination has always been cooperative - that is, we tell it to terminate and it does. We could add some code to TestAgency to actually kill the process if it doesn't terminate within a certain time.

What do you think?

@NikolayPianikov commented on Tue Jun 28 2016

Ok. I will try to find.

@rprouse commented on Tue Jun 28 2016

I need to look at the code, but a timeout in TestAgency and then killing the process seems like the right place.

Better would be identifying why it is happening and fix that ;)

@CharliePoole commented on Tue Jun 28 2016

I reviewed the 3.4 changes and don't see anything that should cause this. Moving on to the other bullet points.

@CharliePoole commented on Thu Jun 30 2016

@NikolayPianikov We really need to get the release out, so we are going with the changes we have so far. We can follow up further on this problem if it continues.

@NikolayPianikov commented on Thu Jun 30 2016

Issue was reproduced on "master", see http://win10nik.cloudapp.net/viewLog.html?buildId=94&buildTypeId=NUnit_NUnit3IntegrationTests&tab=buildResultsDiv

@NikolayPianikov commented on Thu Jun 30 2016

User creates new AppDomain and does not finish thread there. We can reproduce it using mocks.zip

internal class UnloadingDomainUtil
{
    public static void Create()
    {
        var newDomain = System.AppDomain.CreateDomain(System.Guid.NewGuid().ToString(), System.AppDomain.CurrentDomain.Evidence);
        newDomain.CreateInstanceFrom(typeof(UnloadingDomainUtil).Assembly.Location, typeof(UnloadingDomainUtil).FullName);
    }

    public UnloadingDomainUtil()
    {
        new System.Threading.Thread(() => { while(true); }).Start();
    }   
}

[Test]
        public void MyTest()
        {
            UnloadingDomainUtil.Create();
        }

@CharliePoole commented on Thu Jun 30 2016

Can you run it using the release/3.4.1 branch?

@NikolayPianikov commented on Thu Jun 30 2016

One moment

@CharliePoole commented on Thu Jun 30 2016

And you say we were successfully terminating such a case using 3.2.1?

Since this is a user-specific situation, caused by a bad test it could be postponed. The most critical issue is that 3.4 won't run the teamcity extension at all! I can only give this a very short timebox before I do the 3.4.1 release without it.

@NikolayPianikov commented on Thu Jun 30 2016

It is reproduced on 3.2.1 too. May be it is a bit another case (we have an exception "Error while unloading appdomain"), but it could have a similar solution.

@CharliePoole commented on Thu Jun 30 2016

When I add your test to a console run, the only effect is a brief pause before it terminates. This works if I run in process or in a separate process.

I'm going to move ahead with 3.4.1 as it is. After it's out, we can try to address this if it's still a problem.

@tfabris commented on Tue Aug 23 2016

Regarding Nunit issue #1628, "Agent's process stays in memory when it was failed to unload appdomain":

This issue is a blocking issue for us at the moment, due to the fact that we recently converted our company's tests over from Nunit 2.6.4 to Nunit 3.4.1. There are some details about the issue which were not revealed on the original bug report and discussion. We have some new information about this bug. We would like to request that the bug be re-opened because it is a blocking issue.

New details:

Bug definitely did not occur in Nunit 2.6.4.
Bug occurs in 3.4.1 but we do not know in which build it was introduced.
Bug occurs only on certain test DLLs which induce the situation which cause the message to appear: "Unable to unload AppDomain, Unload thread timed out."
Important: Bug only occurs when the output of the console is redirected. For example, in test automation situations where console output is captured to memory, such as when Team City is running the Nunit Console runner to execute tests. This bug causes Team City to hang and not complete the build. For ease of repro, you can also reproduce in a powershell script which redirects console output.
Important: Bug does not occur when you run the Nunit Console at a plain console (output is not redirected).
Example PowerShell command which does NOT cause a hang (bug still occurs, but the output appears normally and drops back to powershell prompt):

& nunit3-console.exe MyBadTest.dll --x86 --framework=net-4.0
Exmple PowerShell command which DOES INDUCE the hang because it is redirecting the output (this is the same thing that happens to Team City):

$result = & nunit3-console.exe MyBadTest.dll --x86 --framework=net-4.0 write-host "$result"
If you watch the Windows Task Manager in the two cases above, you see that there is a launch of Nunit3-Console.exe and one or more Nunit-agent-x86.exe instances. Only the Nunit3-Console.exe closes at the end of the test, the agent instance(s) remain open and do not close. When there is no console redirection, control returns to the console, but when there is redirection, control does not return to the console until the agent process is forcefully terminated.
This is a blocking issue for any instances where Nunit Console is being launched by any type of automation, because it causes the automation to hang, and fail to process any more commands. The test runner should close down when it is done, regardless of whether the tests had an issue unloading the appdomain or not.

@tfabris commented on Tue Aug 23 2016

Clarifying my last message:

The same bug occurs whether or not you redirect the console output: The nunit agent remains in memory and does not leave.
However, you only notice a problem (it's only a blocking issue) in situations where the console output is redirected, such as part of an automation system. That is the situation in which the automation hangs because of the Nunit agent which does not close.

@CharliePoole commented on Tue Aug 23 2016

Some work is ongoing in this area. I'll reopen this and we can test when the changes are in.

@tfabris commented on Tue Aug 23 2016

Sweet. We have an easy repro case, so let me know if you want me to verify a fixed build.

@CharliePoole commented on Sat Aug 27 2016

@tfabris Can you test this against the latest MyGet build? It's at https://www.myget.org/feed/Packages/nunit and the latest version at the moment is 3.5.0-dev-03158.

@tfabris commented on Mon Aug 29 2016

I'm very interested in trying out this build of Nunit against our repro case.

However I am unable to obtain it at the link you provided. Following the link brought me to a signup page for the MyGet service. I had never used MyGet before, and, even after I set up an account there and created a feed, the link you provided ( https://www.myget.org/feed/Packages/nunit ) still does not result in me obtaining the file; it redirects me to a MyGet help/config page. Even after I made sure my profile was filled out and I created a "Feed" in MyGet and tried to add "Nunit" to my feed, same thing: I can't obtain this new build of Nunit that you're trying to link me to. The version of Nunit that is gives me in MyGet is 3.4.1 and I don't see any option to have it give me work-in-progress builds.

Do you have any tips on how to obtain this 3.5.0 dev build?

Thanks!

@CharliePoole commented on Mon Aug 29 2016

@tfabris Sorry - I guess that's only a link for our account to use. There is a Gallery feature in MyGet, but we have not set that up for nunit yet.

Use the following url in Visual Studio (under Nuget sources) to get our myget feed: https://www.myget.org/F/nunit/api/v2

Alternatively, you can grab the build from AppVeyor: https://ci.appveyor.com/project/CharliePoole/nunit-console/build/3.5.0-dev-03158/artifacts

@tfabris commented on Mon Aug 29 2016

Thanks so much, Charlie!!!!

I obtained the package from AppVeyor: package\NUnit.ConsoleRunner.3.5.0-dev-03158.nupkg

I extracted it and ran the console runner contained within that package at the command prompt, pointing it to our repro DLL.

Result: Same problem. Steps:

When the tests in our repro test DLL are complete and all tests have passed, there is the final report of all test scores at the console. All tests pass. Then, after the test scores are displayed, the error message "Unable to unload AppDomain, Unload thread timed out" appears on the console (this is the bad condition that the test DLL must induce in order for the problem below to repro).
At that point in time, "Nunit3-Console.exe" closes and is no longer in memory, as expected. However "nunit-agent-x86" remains in memory and does not leave memory. (This is the bug.)
If you are running directly at the DOS console, then control returns to the console, despite the agent remaining in memory.
If you are redirecting the console output of the Nunit runner (for example, if you are running a script which pipes the output somewhere else for capture/logging, or, you are running in Team City which pipes the output into memory for capture/logging), then the progress hangs and control is not returned to the calling program. It waits forever and does not return. If your test is running in Team City, the the build hangs forever and never finishes. (This is the most critical failure mode of the bug as it blocks production.)

Details, from an analysis by our escalation developer, using the Nunit 3.4.1 build:

The process nunit-agent-x86.exe is kept alive because its main thread is waiting forever for a manual Event to be set. This is the call stack of the main thread:

OS Thread Id: 0x6ff0 (0) Child SP IP Call Site 00afede8 7745718c [HelperMethodFrame_1OBJ: 00afede8] System.Threading.WaitHandle.WaitOneNative(System.Runtime.InteropServices.SafeHandle, UInt32, Boolean, Boolean) 00afeecc 723a49d1 System.Threading.WaitHandle.InternalWaitOne(System.Runtime.InteropServices.SafeHandle, Int64, Boolean, Boolean) 00afeee4 723a4998 System.Threading.WaitHandle.WaitOne(Int32, Boolean) <<< Waits forever here… 00afeef8 723a496e System.Threading.WaitHandle.WaitOne() 00afef00 0539707e NUnit.Engine.Agents.RemoteTestAgent.WaitForStop() 00afef08 00d2173e NUnit.Agent.NUnitTestAgent.Main(System.String[]) [C:\Users\xxxxxxxxxxx\AppData\Local\JetBrains\Shared\v04\DecompilerCache\decompiler\9E876AE6-521E-4A30-9583-EE5D668CC2CC\76\160e3076\NUnitTestAgent.cs @ 92] 00aff0e8 73301376 [GCFrame: 00aff0e8]

This is the NUnit source code relative to the methods above:

public void WaitForStop() { this.stopSignal.WaitOne(); }

@CharliePoole commented on Mon Aug 29 2016

This issue should have been moved to the new nunit-console repository when we split repos. Doing so now.

CharliePoole commented 8 years ago

I confirmed this in our own build of the engine master branch, which had not been running since we split repositories.

tfabris commented 8 years ago

Thanks so much for looking into this in detail, Charlie!

Question: Is there a chance that this issue is a regression caused by the fix to this earlier issue?

https://github.com/nunit/nunit/issues/329

CharliePoole commented 8 years ago

@tfabris I'll check that. The earlier fix was that we were not reporting the error. AFAIK the process would probably still have hung, but I can run some comparisons to see.

tfabris commented 8 years ago

Thanks! I don't know which build it was introduced in, because we've been on 2.6.4 for a long time and only just switched to Nunit3 in the last week or so. So the issue could have been there for a long time, any time after 2.6.4.

rwencel commented 8 years ago

We're having the same (or similar?) problem - The nunit-agent.exe process hangs after the error "Unable to unload AppDomain, Unload thread timed out" when testing certain assemblies. I suspect it's some of our test or test object finalizers hanging and not releasing the finalizer thread but I'm not sure. Since this hangs our TFS build, we're going to workaround the problem for now with a custom NUnit build that calls "_realRunner.Dispose()" in a try/finally block in MasterTestRunner's Dispose method. That seems to be the cause of the hang, that the agent is never signalled to stop. Let me know if I can help anymore.

davidjward30 commented 8 years ago

In case it helps, we also had the same problem when migrating from 2.6.4 to 3.4.1 with teamcity. We finally managed to track down the problem to one of our tests creating a child process which was left running. This this child process was designed to automatically close when it's parent closed so would not affect subsequent test runs on the same machine. We now close the process and all is well.

tfabris commented 8 years ago

In our case, we discovered that the child process which wasn't shutting down was the third party library "Serilog". It would eventually shut down on its own but it took to long for Nunit/TeamCity's tastes and thus we would get the hang and the test failure. We have had to temporarily work around the problem by disabling Serilog during the run of our integration tests, but it means that when the tests reveal a bug, we don't have any logging to look at. So we're anxiously awaiting a fix for this in Nunit 3.x so that we have the option to re-enable logging in these cases.

NikolayPianikov commented 8 years ago

@CharliePoole I looked into this question and I think I could try to make a quite robust fix for this issue, but it will use Job Objects Win32 API. Is this acceptable?

rprouse commented 8 years ago

@NikolayPianikov we cannot use any Win32 API's because we want to maintain compatibility on all platforms.

If you have a way to wrap the Win32 API's into an interface and detect if those API's are available, then return a different no-op interface on other platforms, then we might consider a fix, but please sketch out your plan before you do too much work on it.

rprouse commented 8 years ago

@CharliePoole I haven't seen anything for this. Are you still planning on working on it before release? I don't think we have reliable repro steps do we?

tfabris commented 8 years ago

Our repro steps that we did were 100 percent reliable. Though we don't have example code of a test that will induce them since that's internal company stuff.

Basically the steps were:

Write a test that calls an object that opens a thread and never closes it.
Run this test under Nunit 2.6.4: Get "Unloading Appdomain" message but there is no hang.
Run this test under Nunit 3.x: Get "Unload Appdomain" message and the Nunit-Agent process hangs and never exits.

CharliePoole commented 8 years ago

Although it's hard to replicate, we do understand the source of the problem. Years ago, we decided it didn't matter if an AppDomain didn't unload or a subordinate process never exited. Now we have various cases where it does matter.

I was planning to put something in to kill any processes that we started on exit, provided they are still running. I haven't looked at how easy it will be. TestAgency has the process info and the console runner knows when it's exiting. If it's relatively easy, I'll do it, then we can see whether it solves the problem.

tfabris commented 8 years ago

Sounds like a reasonable plan!

rwencel commented 8 years ago

Another repro is to put an infinite Thread.Sleep in a test fixture's finalizer. I described a fix for the 3.4.1 release above, but it seems the relevant code has changed since 3.4.1. Basically, the ProcessRunner wasn't getting disposed b/c the MasterTestRunner threw an exception b/c the AppDomain didn't unload. Without the ProcessRunner disposed, the agent is never signalled to stop and hangs forever.

ghost commented 8 years ago

Note: In my case I get the error when I am using a external Library (NetMQ). I decided to disable theese unit tests because I can't dig so deep in to that foreign library. My project is open source - take a look at https://github.com/leo90skk/qdms/issues/57 .

NikolayPianikov commented 8 years ago

@rprouse If it was possible to have an extension point to control creating new agents' processes we could create an extension. This extension just should create a Job for parent process with specific parameters all other works will be done by OS. I could make a prototype. Of course we should not forget about Linux Mono and .net core. But on the first step we could do it only for full .net users on OS Windows. Also we could use platform specific assemblies of extensions in the scope of each extension in the future. Also this extension could control many other things. For example run tests under specified user account/windows integrity level and something other useful.

rwencel commented 8 years ago

@rprouse @CharliePoole The problem is still repro in last night's latest source code. Using a try/finally in ProcessRunner's Dispose method fixes the problem. This fix signals the agents to stop even if the AppDomain unload throws an error. It previously never signaled the agent to stop so the nunit-agent.exe process hung on agent.WaitForStop(). I'm really hoping this makes it in 3.5 since we're using a custom build now to avoid a TFS build hang. Thanks guys!

Repro Command Line "nunit3-console.exe" "bin\Debug\Repro.dll" --framework=net-4.5 --agents=4

Repro

    [TestFixture]
    public class TestClass
    {
        [Test]
        public void Test()
        {
            Assert.Pass();
        }
        ~TestClass()
        {
            System.Threading.Thread.Sleep(TimeSpan.FromDays(1));
        }
    }

Fix in ProcessRunner.cs

        protected override void Dispose(bool disposing)
        {
            try {
                base.Dispose(disposing);
            } finally {

                try {
                    if (disposing && _agent != null) {
                        log.Debug("Stopping remote agent");
                        _agent.Stop();
                        _agent = null;
                    }
                } catch (Exception e) {
                    log.Error("Failed to stop the remote agent. {0}", e.Message);
                    _agent = null;
                }
            }
        }

CharliePoole commented 8 years ago

No fix to this issue has been pushed so of course the problem persists.

I agree with your diagnosis but am a little nervous with a try / catch inside a finally block. Have you seen this work when an exception is followed by a second exception inside the finally?

CharliePoole commented 8 years ago

Research indicates it's legal to do a try/catch in the finally but almost universally considered bad form. 😄

I'll try to restructure to avoid the nesting, but if I can't come up with something quick, @rwencel 's code will go into the release.

CharliePoole commented 8 years ago

Dispose is doing two things: unloading the runner and stopping the agent. Hence, there are four cases of success/failure...

Both succeed - no problem, no issue here.

Unload fails but stop succeeds. I'm displaying a two line warning, for example...

Unable to unload AppDomain. Unload thread timed out.
Agent process was terminated successfully after error.

Unload succeeds but stop fails. I'm displaying...

Failed to stop the remote  agent. Simulated failure in agent stop.

Both unload and stop fail. I'm displaying

Unable to unload AppDomain. Unload thread timed out.
Failed to stop the remote  agent. Simulated failure in agent stop.

Roemer commented 8 years ago

I have the same problem and as you say, it should be a issue in the test code, I have one where I spent hours to solve and now I'm out of ideas and still end in the AppDomain unload exception. The problem occurs in one of my projects where I do UI testing. So most tests open an application, do some ui stuff and then exit the application. Now I could create a standalone version which fails 100%. It only contains a WinForms sample app with an empty listview and an empty treeview. The test itself only has some initialization code where it selects the tree. There are various things I found out already:

The test must run at least twice (two TestFixture with parameters) so that the 2nd will fail
If the test is run more than two times, only the last one fails, the others work correct
Deleting the ListView on the sample app solves the issue
Moving the "this.Controls.Add" (in the Designer.cs) for the listview below the one from the treeview solves the issue (this is very strange to me...)
Commenting out var tree = mainWindow.FindFirst(... solves the issue

I am absolutely out of ideas what could go wrong and especially why some totally unrelated things fix this issue. I have a very simple project which shows the problem here: http://flauschig.ch/transfer/TreeHangTest.zip Just build it and then run the runtetst.bat and you see the problem. Anyone mind helping me finding why the app domain cannot unload?

CharliePoole commented 8 years ago

@Roemer Are you saying that the agent still hangs with the fix in the new release?

Roemer commented 8 years ago

@CharliePoole No it does not hang with 3.5 but I get the warning "Unable to unload AppDomain" which I want to fix but I ran out of ideas (see post above). Is there any way to get more information about why it fails?

tfabris commented 7 years ago

I'm glad to hear that it does not hang with version 3.5 any more.

Today I tried to locate the version 3.5 console runner but I cannot find where to download it. I go to this location : https://github.com/nunit/nunit/releases/tag/3.5 but all of the .zip files located there do not contain the console runner. The release notes say that the console runner is now a separate build, but I cannot find where to get a download of that build.

rprouse commented 7 years ago

@tfabris The Nunit Console is now released separately and from this repository, https://github.com/nunit/nunit-console/releases

CharliePoole commented 7 years ago

@Roemer I created issue #111 to look at how to get more info to you when the AppDomain won't unload.

tfabris commented 7 years ago

Rob, thank you so much for pointing me to the correct download location.

davidjward30 commented 7 years ago

I am still getting occasional issues with the agent process staying in memory with 3.5.0. This problem appeared when we migrated from 3.2 to 3.4. I'm fairly sure that the AppDomain is being unloaded successfully in our case. Looking at the 'dead' agent processes, they only have nunit assemblies loaded into memory.

I'd like to help but it seems very hard to reproduce. I could try and get a process dump from the agents in the 'dead' state.

davidjward30 commented 7 years ago

Following on from my previous comment, I'm starting to spot a pattern on our teamcity runs. I have a feeling that "Cancelling" while nunit is running could be causing the agent processes to be left running. I suspect that Teamcity simply terminates the nunit-console process. This has always worked reliably for us in nunit 2.6 and 3.2.

CharliePoole commented 7 years ago

@davidjward30 Your last comment makes a certain amount of sense. The agent process runs until it is told to terminate by the main process. We may need to rethink that if we can't properly handle termination of the process. Could be a fairly big thing. 😦

Since this really has nothing to do with the closed issue of being unable to close the AppDomain, could you please file this as a new issue?

CharliePoole commented 7 years ago

@NikolayPianikov can you tell us if TC cancels the NUnit process in some circumstances and, if so, exactly what it does to accomplish it?

davidjward30 commented 7 years ago

These cancellations occur when a user requests a CI build to be cancelled. @CharliePoole - I have created a new issue #128

davidjward30 commented 7 years ago

@CharliePoole - the way we typically handle this situation in our software is to have the child process open a handle to it's caller and then exit early if it closes. This seemed to be simpler than using "job objects" that was mentioned earlier in this thread.

NikolayPianikov commented 7 years ago

TeamCity has the button to finish a build. TeamCity stops all tree of processes from the root, but sometimes, for example when console.exe was stopped, but agents did not do the same - there is no way to find all children. In this case "Jobs" could help but this approach will work only for Windows. We could use some kind of the "heartbeat" signal to check that the console is still working and each agent should do "auto exit" when the console was closed

tfabris commented 7 years ago

Another reason that team city migh stop the test is if the test has exceeded its runtime limit and is considered to be "hung". Team city will terminate the Nunit console in that situation, but not the agents.

tfabris commented 7 years ago

I'm sorry that I don't know the exact method that team city uses to perform the termination.

pavzaj commented 7 years ago

Hello, I have similar problem. When I run my test suite like so: D:\BuildAgent2\work\c11fa50512543249>D:\NUnit-3.5.0\nunit3-console.exe <testDllsNames> --where="cat!=FunctionalTest&&cat!=IntegrationTest&&cat!=StressTest" --agents=0 --dispose-runners --framework=net-4.5 --work=<workdir> --verbose --teamcity --result=.\CodeAnalysis-Nunit-Result.xml;format=nunit2

I get error: Unable to unload AppDomain, Unload thread timed out. Agent Process was terminated successfully after error. The error occur both on teamcity and on plain windows cmd. No nunit-agent process seems to hang after that. The result file produced after this error is empty. I need it's content very bad :). If I run the same command with nunit console 3.2.1 some test throws exception, but at the end no nunit error is thrown and result file is correct.

rprouse commented 7 years ago

@pavzaj what happens if you run from the command line and add the --inprocess command line?

rprouse commented 7 years ago

@CharliePoole do you think we should open a new issue to track the new reports. It is related, but the issue is different.

pavzaj commented 7 years ago

With --inprocess (instead of --agents=0) it ends up with message "Exception encountered unloading AppDomain". As I said test suite throws some exceptions during execution with console runner 3.2.1. These exeptions does not show during execution with console runner 3.5.0, but the whole run ends with "Exception encountered unloading AppDomain".

The test suite is built with nunit framework 3.2.1.

nunit / nunit-console

Agent's process stays in memory when it was failed to unload appdomain #43