nunit / nunit-console

NUnit Console runner and test engine
MIT License
217 stars 152 forks source link

Planning: replacing remoting with custom protocol #266

Closed jnm2 closed 3 years ago

jnm2 commented 7 years ago

The SocketException thing continues to give us grief and it's gotten to the point where I feel I'm sinking too much time into something we don't ultimately even want to keep. I'd like to put some energy towards the custom protocol idea.

Goals:

I don't expect there will actually be much coding to do, mostly design. I'm fine doing all the coding though of course the more the merrier, but it would be great to have everyone's thoughts on the design.

jnm2 commented 7 years ago

@nunit/engine-team Marking as pri:high because I haven't been able to kick those SocketExceptions entirely and there is no workaround for folks.

jnm2 commented 7 years ago

Discovery

When running on the same machine, discovery happens when the console passes its connection info the the agent it starts. What would it look like if an agent was on a remote machine? Would we assume in that scenario the agent instances would be started manually with --listen --port=12345 and that the console runner would be manually given a list of agent endpoints (name/ip + port)? If started manually, the agent instances should be stopped manually. Should we have default port if not specified?

Or would we want some kind of autodiscovery? We could easily implement a UDP multicast autodiscovery protocol like SQL Server instance discovery uses.

(I have implemented the client side of the SQL instance discovery protocol from scratch because SqlDataSourceEnumerator blocks for an arbitrary amount of time and I wanted a non-blocking ObservableCollection that notifies immediately as each server replies and keeps listening until you shut it down.)

Xamarin agent

This is a very real, very serious thought I've had as I used NUnit with a Xamarin project. We should eventually provide a standalone Xamarin agent application which receives assemblies over the wire from the NUnit engine running on your development machine and tests them, sending results back in realtime.

Let's make sure we don't block this scenario.

Now to be clear, I'm marking off the implementation work for the remote machine scenarios to be done only after we have parity with our current system. This is just the brainstorming phase so we can plan ahead.

jnm2 commented 7 years ago

Protocol API

This is the tricky part. Remoting is a sort of viral magic that mostly just works. It's hard to see the de facto boundaries of the protocol just by looking at the code.

I assume we start with ITestAgent and ITestAgency and work our way back from there.

Which of these methods are actually used in remoting?

ITestEngineRunner CreateRunner(TestPackage);
TestEngineResult ITestEngineRunner.Load();
TestEngineResult ITestEngineRunner.Reload();
void ITestEngineRunner.Unload();
TestEngineResult ITestEngineRunner.Explore();
int ITestEngineRunner.CountTestCases(TestFilter);
TestEngineResult ITestEngineRunner.Run(ITestEventListener, TestFilter);
void ITestEngineRunner.StopRun(bool force);

// Not necessary; everything will have an async overload.
AsyncTestEngineResult ITestEngineRunner.RunAsync(ITestEventListener, TestFilter);

// Let's discuss the wider discovery picture first
ITestAgency Register(ITestAgent);
void ITestAgent.Id;
void ITestAgent.Stop();
void ITestAgent.Start();

Did I miss any?

TestPackage, TestEngineResult and TestFilter can be serialized as binary blobs.

We'll also want to leave open the possibility of including assemblies inside TestPackage so that we can seamlessly run in other machines.

jnm2 commented 7 years ago

Protocol Format

I'm assuming we'll continue to use TCP for all connections, barring initial discovery. I may be wrong but I can't think of any reason we'd want to give folks the ability to use TLS. If you're in a totally rare scenario, that's what VPNs are good at.

When I'm just trying to keep it simple and there is no compat concern, my leaning is towards BinaryWriter/BinaryReader. But let's picture what we really want. Is there a good reason to adhere to an existing protocol format? HTTP? JSON?

jnm2 commented 7 years ago

I think I've gotten as far as I can until we talk about discovery.

rprouse commented 7 years ago

For discovery, I think auto-discovery is probably overkill at this point. When we go to remote machines, opening firewall ports on either the console runner machine or the agent machine will be an issue.

Does it make more sense for the console runner to specify a port and listen on it for the agent to connect back to? The console runner can start with a given port number and increment as it starts agents. Before spawning the agent, it can also check if the port is free. It can't do that for a remote machine.

Also, if we are starting with running agents on the local machine, why not create a communications interface that we can extend to TCP latet, but start with IPC/Named Pipes now? NamedPipeClientStream and NamedPipeServerStream are available in .NET Standard 1.3.

For protocol format, I am fine with binary. I think HTTP is overkill. A JSON wire format would be nice for monitoring in WireShark, but I am not sure we want to manage our own JSON serialization.

jnm2 commented 7 years ago

Good points. Autodiscovery doesn't seem worthwhile, and because we don't have a compatibility concern, we won't be constrained if we ever see a reason to add it later.

Does it make more sense for the console runner to specify a port and listen on it for the agent to connect back to?

If we ever work with remote machines, it seems we'll have to make the listening happen on the agent side. For the remote machine scenario, it would be nice if the agent stayed open by default. If I double click the agent exe, it should stay there until I press Ctrl+C or Ctrl+Break. (It should warn before closing if it's currently in use by a runner.) It should pick a default or random port unless I specify --port=n. I would pass --remote-agent machinename:12345 parameters to the runner letting it know how to find remote agents.

So when we are in the scenario where the runner (rather than the end user) decides to spin up an agent in the same machine, we could:

  1. Keep the same protocol model in-machine as the one we have to use for remote machines, where the listening is done by the agent. The console would pass --port={rnd} as well as --autoclose, wait until a connection is successful like it would in the remote machine scenario, and when the console notifies the agent that it is no longer using it (and the agent's connected runner count is zero), the agent closes because --autoclose was specified.

  2. Equip the runner to do listening as well. My gut doesn't like this as much because it's a separate mode of operation to have to think about for both the console and the agent. It would take a port per runner rather than a port per agent. If we do this, the console would start listening on a random port and start the agent with a --connect=localhost:{rnd} parameter. Is it worth developing both models? Perhaps it will be easy to abstract the two connection models in the same logic which also chooses named pipes over TCP when possible.

Why not start with named pipes? Sounds great to me!

I went through the same thought process with respect to HTTP and JSON. Let's do binary.

jnm2 commented 7 years ago

I'm going to start off implementing this since further feedback isn't forthcoming.

ChrisMaddock commented 7 years ago

@jnm2 - Sorry, haven't replied before as I have no experience in this sort of area, so very little to offer. Everything you've written looks sensible to me.

jnm2 commented 7 years ago

I'm hoping for two scenarios:

  1. Runner starts the agent, uses anonymous pipe
  2. Runner connects to user-started agent (same machine or remote machine), uses TCP

In both cases, --autoclose isn't necessary.

jnm2 commented 7 years ago

Here's my first sketch of the agent interfaces: https://github.com/jnm2/nunit-console/blob/new_agent_protocol/src/NUnitEngine/nunit.engine/Agents/IAgentPool.cs

And this would be close to the final implementation of ProcessRunner, given those interfaces: https://github.com/jnm2/nunit-console/blob/new_agent_protocol/src/NUnitEngine/nunit.engine/Runners/ProcessRunner.cs

@nunit/engine-team Please do look over my shoulder and tell me your impression. It's mainly you who will be maintaining this code if something happens to me. =P

jnm2 commented 7 years ago

Oh fun. .NET 2.0 doesn't have anonymous (or named) pipes; however, netstandard1.3 does via https://www.nuget.org/packages/System.IO.Pipes/.

rprouse commented 7 years ago

That is a lot of code to bring over, probably even more than our Linq code. I think I would prefer to keep it separate if possible. I wonder if the NUnit.Linq code should expand to become NUnit.Compatibility to add in stuff like this?

As far as a review goes, I am having trouble groking exactly what you are planning based on just scanning the code on GitHub. I will probably need to download and look. Do you have a rough architecture diagram?

jnm2 commented 7 years ago

@rprouse I realized we don't even have Task<> let alone ReadOnlySpan<>. I don't think I'd ever finish... I'm thinking either skip pipes for now or else do a custom implementation and not try to adhere to the BCL API.

Hmm, this is something I need to learn how to do without being in person with a whiteboard. I can explain things by commenting on code but I do not have a diagram; is there a good way for me to do that?

jnm2 commented 7 years ago

IAgentPool: One instance per lifetime of the runner. A pool of remote processes in our case. IAcquiredAgent: Represents a lease of a single remote process. Disposing the lease releases the remote process back to the pool to be shut down or kept around at the pool's discretion.

IAcquiredAgent.LoadPackage loads a package into the remote process, returning the TestEngineResult for the test runner and an IAgentPackageContext.

IAgentPackageContext: The reason Explore, CountTestCases and StartRun are on IAgentPackageContext (rather than on IAcquiredAgent with additional TestPackage parameters) is because we may in the future be sending large binaries to a remote device. Once the package of binaries is transmitted it is kept around on the remote device until the IAgentPackageContext is disposed. (See how ProcessRunner implements LoadPackage and UnloadPackage.)

IAgentRunContext: Represents a test run. More than one test run can be started at the same time from the same IAgentPackageContext, because it may be expensive to resend a TestPackage containing large binaries to a remote device. IsCompleted, OnCompleted and GetResult are because we don't have the Task<> type. See how ProcessRunner implements Run and RunAsync.)

CharliePoole commented 7 years ago

@jnm2 If I were doing this, I'd start with a design document.

First I would outline the scope of the project and the problems it's intended to solve. That's never been clear, at least to me, in the issue title / description. We use remoting in a lot of places in NUnit, both in the framework and in the engine. So I would describe all those different places and make clear which ones are to be replaced, and why.

Then I would look at alternative designs. The key issue, it seems to me is how to fit variations in communication mechanism into the architecture. It seems to me that it needs to be represented by some object type within the system, just as we now use drivers to serve as proxies for a framework. Remoting is not actually necessary in the existing architecture, but it is not split out into a layer or module that can be switched in and out, the way runners, drivers and services are. So I think I would want to make the communication channel into a first-class component within the engine.

Once that architecture is established, I'd then look at what communication channels to implement first. There's no more reason to pick one type of channel than there is to pick only one runner type or one service. It should all be easily replaceable. I don't think my first choice would be a named pipe, but that might be a reflection of my own biases. I'd lean toward a very low-level protocl over TCP as the initial implementation.

But whatever you do, I think you need to have a design before you start coding.

ChrisMaddock commented 7 years ago

@jnm2 - I've added this to the 3.8 milestone, as one of the bugs causing people most pain in the console at the moment. Given the work involved, I imagine we might want to move it to a future release, but I think it's one that we-as-a-team should keep an eye on. Hope that's ok!

jnm2 commented 7 years ago

Yes, I definitely want to get to this and would welcome involvement from everyone. I have begun to suspect as Rob suggested that at least some of the time the SocketException is just masking the fact that the agent process crashed because of test assembly or NUnit framework issues (say, CPU pegging). If true, this would continue to follow us into any custom TCP protocol as well.

ChrisMaddock commented 7 years ago

Ah ok. Maybe this isn't the silver bullet I was hoping for. ๐Ÿ˜ž

I guess the StackOverflow case can only be viewed with --debug. I'm not sure what other cases would look like, and if there's any way we could squeeze more information out of agent crashes. Incidentally, this is exactly the issue I've been looking at with https://github.com/nunit/nunit-console/issues/228 today!

ChrisMaddock commented 6 years ago

Taking this out of 3.8 - although I'm not sure we ever expected it in there! ๐Ÿ˜„

jnm2 commented 6 years ago

:+1: I am coming back to this.

ChrisMaddock commented 5 years ago

Realistically, I think this needs to be one of our next steps in .NET Core support. We could do an inprocess-only .NET Standard console (which I might, at some point!) - but I believe it would be restricted to running tests only in runtime framework of the console, as opposed to the framework that the test project targets - which is how we currently treat full-framework tests.

@jnm2 - would you mind revisiting this? Where did things get up to, sounds like there was a branch in progress? Would be happy to help out where possible, but it's not an area I'm very knowledgeable in!

Also, looks like we were considering that this might be an area where targeting .NET 2.0 is causing an extra headache. Given we've just dropped support in the framework, should we revisit the cost of supporting .NET 2.0 in the console/engine?

rprouse commented 5 years ago

Agreed. If we want a full fledge console runner, then we need to look at this. I had hoped that a later version of .NET Core would support remoting, but that isn't going to happen.

.NET Remoting was identified as a problematic architecture. It's used for cross-AppDomain communication, which is no longer supported. Also, Remoting requires runtime support, which is expensive to maintain. For these reasons, .NET Remoting isn't supported on .NET Core, and we don't plan on adding support for it in the future.

For communication across processes, consider inter-process communication (IPC) mechanisms as an alternative to Remoting, such as the System.IO.Pipes or the MemoryMappedFile class.

Across machines, use a network-based solution as an alternative. Preferably, use a low-overhead plain text protocol, such as HTTP. The Kestrel web server, the web server used by ASP.NET Core, is an option here. Also consider using System.Net.Sockets for network-based, cross-machine scenarios. For more options, see .NET Open Source Developer Projects: Messaging.

Personally, I think IPC would be easiest, then we don't need to worry about ports or networks, but that limits us to running on the current machine.

The other task that we should look at is adding .NET standard versions of the extensions. This will require that we come up with a backwards compatible way of finding platform specific extensions in a NuGet package. Maybe we leave the full framework version where it is and put the .NET Standard version in a subdirectory? I also think we may want to limit how many different versions of .NET Standard we support.

ChrisMaddock commented 5 years ago

The other task that we should look at is adding .NET standard versions of the extensions. This will require that we come up with a backwards compatible way of finding platform specific extensions in a NuGet package

We solved this problem earlier this year. ๐Ÿ˜Š The solution we used was chaining .addins files (already supported) - so a .addins file in the extension root dir points to all the subdirs, and then the engine selects the โ€˜bestโ€™ version to use.

This is already implemented, and the same model works for .NET standard extensions - we just need to teach the engine how to correctly identify them!

jnm2 commented 5 years ago

@ChrisMaddock

@jnm2 - would you mind revisiting this? Where did things get up to, sounds like there was a branch in progress? Would be happy to help out where possible, but it's not an area I'm very knowledgeable in!

This would be fun. I didn't get far before realizing I needed a design document as Charlie suggested. https://github.com/jnm2/nunit-console/commits/new_agent_protocol Ten to one I'd start differently now.

What should our short-term goals be? Keeping the current exact functionality while replacing remoting?

Also, looks like we were considering that this might be an area where targeting .NET 2.0 is causing an extra headache. Given we've just dropped support in the framework, should we revisit the cost of supporting .NET 2.0 in the console/engine?

My distant memory says yes, but let's expend effort when we're sure of the value.

ChrisMaddock commented 5 years ago

Keeping the current exact functionality while replacing remoting

I would agree, yes. ๐Ÿ˜Š Main priority in my mind is supporting out-of-process running in the .NET standard engine - that should just require a technology change rather than a functionality change. Removing the silent capture of socket exceptions would be another bonus!

jnm2 commented 5 years ago

Since our model is that the agent processes are started and stopped by the console (TestAgency), by far the simplest minimum starting point is to use each agent process's stdin and stdout as a duplex stream rather than TCP.

When the day comes that the console wants to connect to an agent process that was started before the console process (or on another machine), then it makes sense to make each agent process run a TCP server.

jnm2 commented 5 years ago

I have a proof-of-concept that seems to be working to the point that the rest of the implementation is a simple exercise. I'm really happy with the way things are ending up.

JVimes commented 5 years ago

Any progress? In the meantime, is there any way to see the agent-crashing error message? I had no luck with --inprocesss, verbose logs, or any other parameters.

ChrisMaddock commented 5 years ago

Please can you provide some more information on your situation. What do you mean by โ€˜you have no luckโ€™ with inprocess and logs? Does inprocess prevent the crash? Is there any information at all in the logs?

Can you also add your command line, and what platform are you running on?

JVimes commented 5 years ago

Sure, let me know what you think.

Command:

nunit3-console.exe ".\path\to\MyUnitTests.dll" [--inprocess | --trace=verbose]

Platform: Windows Server Core 1809 amd64

--inprocess causes console output to stop early, no error. Here's the output:

Click to expand
NUnit Console Runner 3.10.0 (.NET 2.0)
Copyright (c) 2019 Charlie Poole, Rob Prouse
Thursday, August 1, 2019 4:02:30 PM

Runtime Environment OS Version: Microsoft Windows NT 10.0.17763.0 CLR Version: 4.0.30319.42000
Test Files C:\path\to\myUnitTests.dll
PS C:\project\path>

--trace=verbose creates three log files. I don't recognize any useful info, but maybe you do: nunit-agent_1964.log, InternalTrace.1964.myUnitTest.dll.log, InternalTrace.608.log.

Here's the original console output:

Click to expand
NUnit Console Runner 3.10.0 (.NET 2.0)
Copyright (c) 2019 Charlie Poole, Rob Prouse
Thursday, August 1, 2019 3:36:22 PM
Runtime Environment OS Version: Microsoft Windows NT 10.0.17763.0 CLR Version: 4.0.30319.42000
Test Files C:\mySolutionFolder\myUnitTestProjectFolder\bin\Release\myUnitTestProject.dll
.\packages\NUnit.ConsoleRunner.3.10.0\tools\nunit3-console.exe : + CategoryInfo : NotSpecified: (:String) [], RemoteException + FullyQualifiedErrorId : NativeCommandError
Unhandled Exception:
NUnit.Engine.NUnitEngineException: Remote test agent exited with non-zero exit code -2146232797 at NUnit.Engine.Services.TestAgency.OnAgentExit(Object sender, EventArgs e) at System.Diagnostics.Process.OnExited() at System.Diagnostics.Process.RaiseOnExited() at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx) at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx) at System.Threading._ThreadPoolWaitOrTimerCallback.PerformWaitOrTimerCallback(Object state, Boolean timedOut)
Errors, Failures and Warnings
1) Error : System.Net.Sockets.SocketException : An existing connection was forcibly closed by the remote host --SocketException An existing connection was forcibly closed by the remote host
Server stack trace: at System.Net.Sockets.Socket.Receive(Byte[] buffer, Int32 offset, Int32 size, SocketFlags socketFlags) at System.Runtime.Remoting.Channels.SocketStream.Read(Byte[] buffer, Int32 offset, Int32 size) at System.Runtime.Remoting.Channels.SocketHandler.ReadFromSocket(Byte[] buffer, Int32 offset, Int32 count) at System.Runtime.Remoting.Channels.SocketHandler.Read(Byte[] buffer, Int32 offset, Int32 count) at System.Runtime.Remoting.Channels.SocketHandler.ReadAndMatchFourBytes(Byte[] buffer) at System.Runtime.Remoting.Channels.Tcp.TcpSocketHandler.ReadAndMatchPreamble() at System.Runtime.Remoting.Channels.Tcp.TcpSocketHandler.ReadVersionAndOperation(UInt16& operation) at System.Runtime.Remoting.Channels.Tcp.TcpClientSocketHandler.ReadHeaders() at System.Runtime.Remoting.Channels.Tcp.TcpClientTransportSink.ProcessMessage(IMessage msg, ITransportHeaders requestHeaders, Stream requestStream, ITransportHeaders& responseHeaders, Stream& responseStream) at System.Runtime.Remoting.Channels.BinaryClientFormatterSink.SyncProcessMessage(IMessage msg)
Exception rethrown at [0]: at System.Runtime.Remoting.Proxies.RealProxy.HandleReturnMessage(IMessage reqMsg, IMessage retMsg) at System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(MessageData& msgData, Int32 type) at NUnit.Engine.ITestEngineRunner.Run(ITestEventListener listener, TestFilter filter) at NUnit.Engine.Runners.ProcessRunner.RunTests(ITestEventListener listener, TestFilter filter)
Test Run Summary Overall result: Failed Test Count: 0, Passed: 0, Failed: 0, Warnings: 0, Inconclusive: 0, Skipped: 0 Start time: 2019-08-01 21:36:22Z End time: 2019-08-01 21:36:25Z Duration: 3.031 seconds
Results (nunit3) saved as TestResult.xml
NUnit.Engine.NUnitEngineUnloadException : Agent Process was terminated successfully after error. ----> System.Net.Sockets.SocketException : No connection could be made because the target machine actively refused it 127.0.0.1:49192
mikkelbu commented 5 years ago

@JVimes -2146232797 indicates COR_E_FAILFAST.

Ps. Would you mind creating a new issue for this as it is bit off-topic of "Planning: replacing remoting with custom protocol", and this issue has already 30+ comments.

JVimes commented 5 years ago

Well, I really just came to ask my first comment:

Any progress? In the meantime, is there any way to see the agent-crashing error message?

Rather than you guys help me debug (I appreciate it!), I'd rather the console runner show the root error message. My understanding is that some agent errors are lost and a socket exception is shown in their place. Users struggle to debug without the real error: #225, #335, #340, #370, #391, #608, etc.

I thought this ticket was suggesting writing a .NET Standard console runner that helps with this problem ("The SocketException thing..."). If so, that's what I'm checking in on.

mikkelbu commented 5 years ago

Sorry if I sounded grumpy, but many of our issues have a tendency to become quite long (and time consuming to read).

With respect to progress then I think it is better to ask in the PR that is associated with this issue, #576, but the last comments in the PR are from the beginning of May, and I don't think much have been done since then. @jnm2 can possibly provide more information.

Regarding your problem then I don't think that a new protocol will solve the issue as it also fails when run in-process (usually we recommend to run in-process to diagnose problems), so my best guess is still that somewhere there is a call to Environment.FailFast(...) (at least I can exhibit the same behaviour locally - and I can find a stacktrace and message in the Windows Log => Application event log).

JVimes commented 5 years ago

No worries, I know how it goes. Thanks!

jnm2 commented 5 years ago

Replying to this PR thread: https://github.com/nunit/nunit-console/pull/576#discussion_r270631647 just for ease of discussion.

I made a mistake in trying to tie the concept of the agency to the agent process. My goal was for the user to be able to start an agent process that would be longer-lived than the console process. For that to work, the process that the user starts needs to be the TCP host and the console runner needs to be the TCP client. Neil also needs the option to have each agent process run a TCP host if I understand his comments.

But it also makes sense to have the console as the TCP host when it is the one spinning up the agent processes because there is then only one host instead of one per agent. It's also easier to talk about because that's what we've been doing. (I appreciate that background, Charlie.)

There might even be a good reason to have the ability (not the necessity) of running a long-lived server process (agency) with shorter-lived agent processes and shorter-lived console processes. Imagine a CI script testing on a device emulator. It might have steps like this:

I'd rather call the "agency" (the TCP host server) the "agent" and call the "agent" a "worker" or "test host", because it aligns better with what I think other tools mean by "agent" (= longest-lived listening service), but we can't do that. That would add to the confusion. ๐Ÿ˜€

With my head on straighter thanks to input from all of you, I'm going to make another attempt at the smallest PR that can get us off remoting.

jnm2 commented 5 years ago

I just remembered. In my code, I'm using the names AgentServer and AgentClient. Would it be potentially okay to go with those terms rather than Agency and Agent, respectively?

CharliePoole commented 5 years ago

In terms of naming, it's of course reasonable to use names that make sense to you and everyone else. But it might be good to first grok the usage that's in place.

TestAgency is, of course, an internal service, part of the engine and used by it. The name is an instance of Metaphor, in the XP sense. You go to an agency to find agents. Agents report in to the agency. And so on. Nothing to do with communication.

The notion of an Agent is that it's out there doing stuff on behalf of the engine. The main thing it does is to come up with a Runner when asked.

I came up with the agent concept when I first introduced running tests in a separate process. It was needed there, as a way to introduce a level of indirection. That is, the Agent could decide what kind of Runner was needed and could delay that decision until the last moment.

Because if that context, we associate the agent with the agent process and how it's launched, but that's purely coincidental. I always meant to expand the idea to cover other ways of running, including in process. I may still do that in my own work.

The Agent might well encapsulate a particular means of communication, or possibly some separate interface might be designed to represent the means of communication between the engine and agents. When we started, it wasn't obvious that the communications channel would end up being something that needed to vary, so use of remoting was baked in. IMO it would be a mistake to just replace remoting with a different baked-in channel, as I think you have also discovered in examining the question of who should be client, who server.

I hope this info proves helpful.

CharliePoole commented 3 years ago

@nunit/engine-team I think this can now be closed without action. Do you agree?

jnm2 commented 3 years ago

Yes.

CharliePoole commented 3 years ago

We now have an alternative protocol to remoting, so closing this.