RFC: Deprecating NNG - Githubissues

bbros-dev commented 3 years ago

We're carving this out as a separate topic...

Just an FYI: I am in the process of removing the dependence on nng and reworking the Manager-Worker logic on top of raw Tokio sockets, the same as the controllers are currently implemented. The underlying functionality will be the same, though this will allow me to address a number of known Gaggle (distributed loadtest) bugs and increase flexibility.

bbros-dev commented 3 years ago

Hmm, color us skeptical on removing NNG.

Message passing/distributed computing is hard. We cut our teeth on MPI and we've watched RabbitMQ, ZeroMQ, nanomsg, magoes, etc. etc. and its evolution.
There are a ton of hard lessons learned accumulated in those libraries. Most importantly the things not to do are not in those libraries, and patterns that are more pain than they are worth are either not supported or difficult to pursue.

Can you point to which of the patterns in these categories Goose uses or is configured for?

We understand the NNG reference manual likely contains updated best practices, but we don't have that and the above are publicly available links to refer to.

If you do decide to replace NNG can you consider the option of making it a plug-able component? This way people can use what they use.

We don't underestimate the work involved in introducing an abstraction that allows plug-able components.

If a plug-able component architecture proves too onerous, could you consider arranging the code base so that reintroducing nng isn't too much of a chore.

jeremyandrews commented 3 years ago

The NNG dependency adds some complexity to building Goose, which is why it's currently wrapped in a feature - by re-implementing it with Tokio sockets I can remove the feature which simplifies development and testing a bit.

We're using the request-reply pattern. (See here: https://github.com/tag1consulting/goose/blob/main/src/worker.rs#L61 which is documented here: https://docs.rs/nng/1.0.0/nng/enum.Protocol.html#variant.Req0)

I recently added a Controller:

The plan is to make it possible to control the Gaggle through the controller socket, where the Workers make a connection to the Manager and then we maintain a reasonably simple state machine.

Fundamentally, a Worker is the same as a Standalone Goose instance, except it also pushes metrics data to the Manager. The other main driver of replacing NNG is to allow the Controller to control a distributed Gaggle, not just a Standalone instance.

We currently support two protocols (telnet and WebSockets) via Traits. So in theory you could plug in a third protocol by implementing these traits. Offhand however I don't think that's flexible enough to re-implement with NNG.

bbros-dev commented 3 years ago

Fundamentally, a Worker is the same as a Standalone Goose instance, except it also pushes metrics data to the Manager

Yes. we were puzzled about why NNG would be causing runtime headaches.
It sounds like the headaches are purely build time. Correct?

We currently support two protocols (telnet and WebSockets) via Traits. So in theory you could plug in a third protocol by implementing these traits. Offhand however I don't think that's flexible enough to re-implement with NNG.

Not sure I understand. I was asking about making the messaging/distributed-communication component plugable. The controllers would simply have their messages handed over and passed by whatever messaging plugin was in place.

We had no real headaches building in gaggle, and what hiccups there were are purely documentation issues. We can envision Gaggle use cases becoming quite demanding - but we cannot imagine them exceeding the capabilities of nng, nanomsg, etc.

Is this, distributed computing/messaging systems, really your sweet spot? We don't know.

We hope you aren't offended by this, we do appreciate all the work and effort put in: But, if this was being led by Martin Sustrik or Garrett D'Amore we would say no worries - however we'd still be hesitant and stick with the current release until it generated errors the new version fixed.

Can you point to the issue reports that have NNG as the root cause, and at least a discussion of how what's being proposed with Tokio would fix that.

Right now, we're very nervous about ripping out a foundational piece of battle tested fabric (that gives considerable head-room), for something home brewed.

bbros-dev commented 3 years ago

The plan is to make it possible to control the Gaggle through the controller socket, where the Workers make a connection to the Manager and then we maintain a reasonably simple state machine.

Sounds great. The controller appears to be local only right now, and it seems one of the goals/rationales for NNG's was to make custom protocols less painful to address.
From here:

The other, most critical, motivation behind NNG was to enable an easier creation of new transports

bbros-dev commented 3 years ago

then we maintain a reasonably simple state machine.

We'd advocate going the other direction... of trying to make the Gaggle stateless as possible. As we indicated in the discussion around CO-mitigation (comment link to come), where the challenge posed as to ensure Requests were being made at fixed intervals. Specifically there you don't want every member of the gaggle starting at exactly the same time, but rather staggered at some known interval. The choices as we saw them were (from memory): 1) some messaging system 2) some agreed convention.

The more we get into this the more we think the end state (pun intended) to aim for is a Gaggle that is as stateless as possible. And state related to the underlying messaging fabic sttaus should be handled by the messaging library.

jeremyandrews commented 3 years ago

It sounds like the headaches are purely build time. Correct?

Correct, from this blog: "On most linux distributions you have to add cmake and openssl-dev."

(And for a time it was impossible to build the Gaggle feature on Apple's M1 Silicon amd64 architecture: fortunately this was fixed with NNG 1.0)

We'd advocate going the other direction... of trying to make the Gaggle stateless as possible.

Yes, as noted above, Workers are primarily Standalone load tests that are sharing metrics with the Manager. But with the Controller they still have to maintain a general state of the load test, what is internally referred to as the AttackPhase.

to ensure Requests were being made at fixed intervals

Coordinating the timing of Requests across a distributed Gaggle is not on my roadmap FWIW: the intent is that while the Manager can Start/Stop/Configure/Shut-down the Worker, when actually running a load test the Workers are essentially standalone and therefor not impacted by the communication channel with the Manager.

Performing a staggered start is in scope. Maintaining a staggered request timing is not.

Right now, we're very nervous about ripping out a foundational piece of battle tested fabric (that gives considerable head-room), for something home brewed.

Fair. We'll see how it goes, it's possible as I find the time to work on this I may conclude we're better off working with NNG. I certainly don't deny the rich history and experience that it brings.

That said, the only significant traffic in a Gaggle is pushing Metrics from the Workers to the Manager (and this is fundamentally what is already happening within Goose as well as each GooseUser is passing Metrics up to the parent process for aggregation and reporting).

bbros-dev commented 3 years ago

It sounds like the headaches are purely build time. Correct?

Correct, from this blog:

Good to know. In out experience, these sorts of issues tend to be transitory.

Performing a staggered start is in scope. Maintaining a staggered request timing is not.

No problem. For a CO fix the effect of an intra-user schedule will dominate by orders of magnitude the effect of any inter-user schedule.

Right now, we're very nervous about ripping out a foundational piece of battle tested fabric (that gives considerable head-room), for something home brewed.

Fair. We'll see how it goes,

Thanks.

That said, the only significant traffic in a Gaggle is pushing Metrics from the Workers to the Manager

Understood. We can envision long running processes monitoring SLA's and the connection recovery management isn't straight forward.

tag1consulting / goose

RFC: Deprecating NNG #290