Agent protocol design: allow concurrent runners for the same agent?

jnm2 commented 4 years ago

/cc @nunit/engine-team /cc @CharliePoole because I want to make sure we're not blocking a UI scenario /cc @oznetmaster because you also had a use case that I'd like to be aware of

The current design of ITestAgent has ITestEngineRunner CreateRunner(TestPackage package). This means that you can create as many runner instances as you like for the same agent instance and use them all at the same time.

We do not make use of this functionality. ProcessRunner calls ITestAgent.CreateRunner immediately upon getting an agent. It proceeds to make all its calls through that single runner instance before calling ITestAgent.Stop().

Implementing this functionality in our new wire protocol seems like a lot of work for something we currently don't use. It would require either 1) a new TCP connection per runner or 2) multiplexing all the commands and test results between multiple runners in the same connection. With multiplexing, we'd have to manage a header on every runner message in our protocol specifying which runner session it's talking about. We'd need a system to be sure one runner didn't hog the whole TCP connection.

Allowing the same agent to run more than one runner does sound like it has potential to me, but mainly from the perspective of inverting the whole thing so that the agent is the listener/server and the console is the client. One runner per client connection is much more natural; clients can open up multiple connections if they want multiple runners. However, we weren't interested in inverting the server/client relationship between the console and the agent. See https://github.com/nunit/nunit-console/pull/576 for that implementation and discussion.

If we decide not to support concurrent runners in the same timeframe as moving off of remoting, I will do a simplification PR while we're still using PR before moving forward. It will make reviews significantly easier to follow if we make the remoting interfaces match the design of the replacement protocol before actually replacing it.

It would look something very roughly like the following. Before:

public interface ITestAgent
{
    void Stop();
    ITestEngineRunner CreateRunner(TestPackage package);
}

After:

public interface ITestAgent
{
    void ShutDown(); // Modified, was: void Stop();

    // Copied from from ITestEngineRunner:

    TestEngineResult Load(TestPackage package); // Modified, was: TestEngineResult Load();
    void Unload();
    TestEngineResult Reload();
    int CountTestCases(TestFilter filter);
    TestEngineResult Run(ITestEventListener listener, TestFilter filter);
    TestEngineResult Explore(TestFilter filter);
    // *Eventually* would likely disappear or change form:
    AsyncTestEngineResult RunAsync(ITestEventListener listener, TestFilter filter);
    void StopRun(bool force);
}

jnm2 commented 4 years ago

@ChrisMaddock I haven't heard anything yet. Are you okay with me going ahead with this change, or should we spend more time looking at alternatives?

CharliePoole commented 4 years ago

@jnm2 I didn't notice this when you first put it out, so I'll comment now. I recently spent some time thinking about this myself.

I doubt that creating multiple runners at the same time would really work, simply because we have never used it and never tested it. If we did so suddenly, what are the odds it would work? In fact, you have made me think the agent should probably give an error return if somebody calls for a new runner while an old one is already running.

However, even though we don't now do it, I do believe we want to be able to create a new runner on the same agent sequentially at some point. Right now, when we reload an assembly (something else that isn't really exercised too much) we get a new agent, which implies creating a new process. We do that because the checking necessary to reuse an agent (and process) is a bit burdensome. There is (or at least was) a comment in the TestAgency code: "should we reuse agents?", which probably should have been a paragraph or maybe a document instead.

So I think we need to keep the agent and runner instance separate as it does no harm and isn't even visible to users. All the decisions happen in the process runner, which is also not visible to users through the API. And if we ever did need to reuse agents and/or allow multiple runners, we could go there - albeit by doing a bunch of work.

I think today, after exposure to the modern culture that says "If it's accessible it ought to work," I would probably have made more error checks. Not to late to add those now though, especially where the feature never actually worked. BTW another thing that it looks like you could do but can't is to have multiple engines... but that's another discussion. :smile:

Looking to the future, when we run on multiple machines, I think the reuse of agents will be even more important. Once we have established a onnection with the other machine, we would like to hang on to it. We will want to be able to tell that machine to create additional agents and we will want to reuse those agents to unload and re-crreate runners. The technical client / server role will be reversed but the interface from the point of view of the original engine / host shouldn't have to change.

jnm2 commented 4 years ago

@CharliePoole Thanks. What I'm hearing is that using runners sequentially should be allowed and using them concurrently should not. Right now ITestAgent and ITestAgency are the protocol definition between the two processes. I'd also like to move to the new protocol in small incremental changes.

To represent the new protocol and get close to what it will look like, I think my proposed change above ticks all the boxes we've discussed. At the internal protocol level, rather than calling ITestAgency.CreateRunner(TestPackage) and then calling ITestRunner.Load(), we'd represent that as a single call to ITestAgency.Load(TestPackage). Eventually, the internal protocol will force us to stop thinking in terms of OOP and interfaces and rather in terms of messages over the wire. My proposal above would represent these message types to the agent over the wire:

internal enum AgentWorkerRequestType : byte
{
    ShutDown,
    Load,
    Reload,
    Unload,
    CountTestCases,
    Explore,
    Run,
    StopRun
}

Stepping back from the internal protocol level, the agency API would still provide a ITestRunner CreateRunner(TestPackage) method to ProcessRunner. ProcessRunner would not notice the proposed change. Have you seen what I'd like to do at https://github.com/nunit/nunit-console/pull/762 yet?

jnm2 commented 4 years ago

Decoupling the runner interface from the agent protocol (ITestAgent) and incrementally morphing ITestAgent as an internal implementation detail into the same shape as the internal wire protocol will save me a lot of implementation work and internal protocol complexity. The incremental changes will be easier for everyone to follow.

ChrisMaddock commented 4 years ago

Hey Joseph - Am I understanding right that your proposal would still allow for the use of sequential runners, just not concurrent? In that case, definitely sounds good to me - concurrent runners sounds like it could be 'future enhancement' terriorty if we ever did want to go for it.

Even if you were suggesting to rule out sequential runners (which I don't think you are!) - I'd be tempted to go for that. Let's MVP this for now, and we can always build on it in future. 😄

jnm2 commented 4 years ago

@ChrisMaddock Right, the agency would allow sequential runners to be created. Under the covers, creating a new runner would be implemented by sending the protocol message of calling ITestAgency.Load(TestPackage) (notice the added TestPackage parameter). This would tell the agent to dispose the current runner and create a new one with the given package and load it.

ChrisMaddock commented 4 years ago

Sounds good! :-)

jnm2 commented 4 years ago

One thing I didn't address is if we wanted ITestAgent to be a more general concept rather than a communication implementation detail. Well... that's a lot of up-front work. Remoting makes the cost of chatty protocols invisible, but hand-implementing a protocol puts a new pressure on boiling down all remote calls (wire messages) to the essentials. I'd rather lazily add things to the protocol as needs arise.

ChrisMaddock commented 4 years ago

I'm not sure if we've discussed this before, but one thing I'd really like to have in future is user-extensible test agents, installed in the same way as our current extensions. That would potentially mean I (as a user) could hook up the Xamarin runner, or a Blazor test agent to the console, and test my library across all these platforms, without the need for the engine itself supporting them. (It would also mean we could split off and archive the .NET 2.0 agent, without worrying about breaking changes for people. 😛)

I'm not sure if that idea would influence your design at all...everything I've seen looks pretty geared towards that, but thought it might be worth mentioning. 🙂

jnm2 commented 4 years ago

By being extensible, I'm hoping the design doesn't require much foresight at all. 😊 I'm building the protocol with the assumption that we will be connecting servers and clients of differing versions at some point. This is important if the client is a user-extensible agent. If a command isn't understood, the other end will be able to decide if they want to abort or try a different command.

As far as exposing extensibility points for the engine to start, run, and shut down custom agents, I think we can even work on that independently of the work to replace the protocol.

CharliePoole commented 4 years ago

@jnm2 I hope you won't eliminate the API distinction between agent and runner. They are different objects with different responsibilities. An agent's job is to initialize or find a runner, which is then passed to the caller. It's also there to ensure that runners are properly cleaned up (although we don't do it as well as we should rignt now). Agents are also extensible so you can have different kinds: AppDomain agents, device agents, remote machine agents, etc. without the need to change the runners at all. it's that famous level of indirection that solves every problem. :smile:

Further, if we put the two interfaces together, it's a breaking change to later separate them. But we can keep them separate and readily combine the implementation. For example, in my TCP transport implementation, the same endpoint is used for the created runner as for the engine. The transport object just forwards the calls differently.

Finally, although not the most important consideration, this would create yet another compatibility issue to be added to the TestCentric / NUnit compatibility project, which we are just getting started.

From the original post, it seems like some aspect of the TCP implementation or the wire protocol have motivated you to want to do this. Wire protocols, of course, are very low level and not application-specific at all. Nevertheless, we know that the API requirements can influence what you need from the wire protocol. Can you say more exactly how having two interfaces changes the wire protocl?

CharliePoole commented 4 years ago

@jnm2 PS: No I haven't yet seen that PR. I just got here. :smiley:

CharliePoole commented 4 years ago

@ChrisMaddock @jnm2 IMO the need for extensible or add-on agents, which I've started looking into myself at Chris' suggestion, doesn't drive the communications side of things except in one way... It requires it to be much more flexible than otherwise.

Hypothetically, we could come across a new target for running tests that our existing communications protocol didn't support. That's why I think protocols, message structures and transports need to be selectable initially and pluggable eventually.

I'll state again that what I am talking about has an implementation already in the TestCentric engine. I was waiting till we moved forward with the compatibility project to come back to this, but maybe it would be a good time to look at the fundamental choice between replacing remoting and encapsulating the communications protocol so we can have options.

jnm2 commented 4 years ago

@CharliePoole

I hope you won't eliminate the API distinction between agent and runner.

Nope, if anything I'm trying to make the boundary even more distinct. ITestRunner and its uses would remain unchanged; it's only ITestAgent that is changing and heading towards being an internal detail not exposed outside TestAgency.

Can you say more exactly how having two interfaces changes the wire protocol?

I didn't mean to say this anywhere. My proposal is to leave ITestRunner unchanged and slowly move ITestAgent to be an implementation detail at a lower and rawer level until, at the last step, we swap out remoting with hardly any change because only binary data and no .NET references are being passed back and forth. Things using ITestRunner won't notice a difference.

jnm2 commented 4 years ago

What I would like to leave out of the wire protocol is the ability to use more than one runner at a time. Given that ITestAgent will representing the wire protocol as a way to make the implementation incremental, I need to simplify it to have exactly the same semantics and call patterns as the wire protocol.

CharliePoole commented 4 years ago

Have you published something describing your wire protocol? (If not, cool, as I haven't either.) :smile:

jnm2 commented 4 years ago

Not yet, though the first draft PR was 95% of it. I'm just starting to rewrite from scratch in https://github.com/jnm2/nunit-console/tree/replace_remoting_2/src/NUnitEngine/nunit.engine/Communication.

Do you think it's worth drawing something up based on what I know now, or is it fine to treat the PRs as a way to present a step at a time?

ChrisMaddock commented 4 years ago

Personally, I'm happy with the incremental PRs. I'd be keen for this to be reviewed by more-than-just-me however, so if others would find it helpful... 😄

CharliePoole commented 4 years ago

What's the "first draft PR"? I.e., different from this one? I'm trying to catch up!

jnm2 commented 4 years ago

It's the one from 15 (😱) months ago that you've seen: https://github.com/nunit/nunit-console/pull/576 I'm keeping it as a loose reference.

CharliePoole commented 4 years ago

OK, that didn't go into wire protocols as I recall. At least my involvement was external and high-level. Bear with me.

jnm2 commented 4 years ago

That PR implemented the protocol without documenting it.

In my new branch, I'm making a struct for each message to model it: https://github.com/jnm2/nunit-console/blob/replace_remoting_2/src/NUnitEngine/nunit.engine/Communication/Model/ConnectAsAgentWorkerRequest.cs

But I am not working far ahead this time.

CharliePoole commented 2 years ago

We are now working on 4.0 using a different implementation of the communications protocol from the one discussed here. We need to find out whether or not this issue is obsolete.

We need to test that creating a second runner on the same agent either works (by automatically closing the agent) or gives some sort of error. The choice could be up to each agent, once we implement separately compiled agents.

nunit / nunit-console

Agent protocol design: allow concurrent runners for the same agent? #763