grafana / alloy

OpenTelemetry Collector distribution with programmable pipelines
https://grafana.com/oss/alloy
Apache License 2.0
1.29k stars 174 forks source link

[RFC] Grafana Agent flow mode plugins #350

Open rfratto opened 1 year ago

rfratto commented 1 year ago

The most recent copy of this proposal can be found on Google docs. The below is the original version of this proposal for posterity.

NOTE: This proposal is likely being written 8-12 months too soon. However, plugins are likely to be the last foundational piece of the future of Grafana Agent (following flow mode and modules). As the last foundational piece, it is important to get alignment on future plans early so we do not take any actions that goes against the long-term goals.

If you are reading this and are excited for plugins, do not expect rapid movement or delivery any time soon. The best case scenario is that plugins are made generally available in a 2025 release.

Background

Currently, new capabilities to Grafana Agent Flow can only be added by contributing a new component to the official Git repository (https://github.com/grafana/agent).

Having a centralized repository of components causes issues for an open source project to thrive:

Other projects, such as the OpenTelemetry Collector, solve this problem by having different distributions of the collector. While distribution solves the issues above, it also fragments the community, as different distributions may have different subsets of components, making migration between distributions difficult.

I propose that we support a plugin system for Grafana Agent flow mode, allowing sets of components to be provided by external plugins which can be loaded at runtime into the Grafana Agent process.

This proposal serves as a high-level proposal of plugins to achieve maintainer and community consensus on the long-term goals.

Goals

Non-goals

Proposal

Flow mode should introduce the concept of a "plugin," where a plugin is some loadable code that provides one or more components that can be defined in a Flow configuration.

The mechanism through which plugins are created, retrieved and defined are not in scope for this proposal:

Requirements

These are high-level requirements that plugins must achieve:

If it is not possible for us to create a plugin system which meets these two requirements, we should consider abandoning the plugin model.

Performance

The biggest concern with plugins is performance of component communication. Today, communication between two components (such as prometheus.scrape sending metrics to prometheus.remote_write) is largely achieved using shared memory, as it's internally represented by a native function call.

However, plugin communication is unlikely to be able to use shared memory; the only mechanism where shared memory is available is through the plugin package, but that doesn't support Windows, making it OS dependent. All other potential communication mechanisms will involve some kind of message marshaling and unmarshaling between running plugins and the plugin host (Grafana Agent).

Before full development on plugins begins, a proof of concept is needed that demonstrates the overhead of message marshaling and unmarshaling to prove the viability of plugins.

Sub-proposals

For plugins to be fully realized, we need at least these five proposals to build on top of this one:

These proposals will likely be written by different people across a large period of time. The first proposal, plugin component communication, is a prerequisite for all other plugins, as it will prove or disprove whether plugins can be performant.

Delivery plan

Assuming plugins are viable, it will be delivered in four phases:

rfratto commented 1 year ago

Plugin component communication: explore the mechanism through which messages are sent between components across plugins for tasks such as sending telemetry data, and whether it's possible for plugins to be performant.

Plugins have been on my mind a lot, especially whether they're even viable.

I would like to personally write this next proposal, but I would like to see others take the other 4 (though I'll still want to be involved to some extent).

captncraig commented 1 year ago

A few thoughts and concerns:

Development Experience

One of the key selling points of the flow philosophy is that a component is a single go package that self-registers and is a relatively standalone, testable piece of code. I wouldn't want plugins to introduce a different experience. If a developer has to choose whether they are making a plugin or a compiled component early in the process, that would be undesirable. Ideally, you should be able to import the exact same go package without modifications and run it as a plugin the same as if it were compiled in for maximum portability and flexibility. That may be unrealistic, but I think it should be a goal. I would hate to fracture a very young open-source community into vastly different runtime modes (which is why lua would be a very hard sell for me).

Living without plugins

The status quo is that all of the current components are compiled into the agent binary. The self-registration mechanism makes that really nice because you can import with _ and be done with it. This proposal lists a few "political" reasons components would not be included in the main repo, but none of those preclude somebody from compiling the agent themselves with a custom imports file.

Since plugins will almost certainly have a performance cost, I'd argue they need to have a significant ease of use benefit over the current paradigm to be worth it. I remain skeptical any of the currently available solutions for go will do that, but I'd love to be proven wrong.

We could alternately dedicate time to normalizing and facilitating the creation of custom agent binaries with arbitrary combinations of component packages. I made a proof of concept for personal uses, and it has some rough edges, but was not intolerable. With some docs and maybe some tooling, we could make it pretty easy for somebody to create an agent with (or without) whatever components they want.

rfratto commented 1 year ago

Ideally, you should be able to import the exact same go package without modifications and run it as a plugin the same as if it were compiled in for maximum portability and flexibility. That may be unrealistic, but I think it should be a goal.

This is a goal I share, but it's not discussed in this design doc since I don't go over the API at all. It should be possible, but it may require new restrictions on a component's API.

In particular, if plugins are built using WASM or system binaries, exporting interfaces introduces a new challenge. The plugin engine would ned to be able to provide some value for an interface across plugins, but Go doesn't allow interfaces to be built at runtime. A workaround for this is to introduce some kind of code generation to build interface implementations, but I don't know if that's something we'd want to do since it complicates the build process.

It is, however, possible to build functions at runtime. If we were to restrict our existing APIs such that you could only export structs of functions, then plugins would be able to work as native components do today, and both native components and plugins would be built exactly the same.

Since plugins will almost certainly have a performance cost, I'd argue they need to have a significant ease of use benefit over the current paradigm to be worth it. I remain skeptical any of the currently available solutions for go will do that, but I'd love to be proven wrong.

Developing a component in a plugin will not be easier than the current paradigm, but it won't be harder either.

However, plugins solve important problems that we've been facing:

We could alternately dedicate time to normalizing and facilitating the creation of custom agent binaries with arbitrary combinations of component packages

This is exactly what the RFC is arguing that we shouldn't do:

Other projects, such as the OpenTelemetry Collector, solve this problem by having different distributions of the collector. While distribution solves the issues above, it also fragments the community, as different distributions may have different subsets of components, making migration between distributions difficult.

I propose that we support a plugin system for Grafana Agent flow mode, allowing sets of components to be provided by external plugins which can be loaded at runtime into the Grafana Agent process.

We see this pattern with OpenTelemetry collector, and I'm overall not a fan of the distribution-type model for the reason above. The scenario of "this distribution doesn't have a component I want, so I have to fork it or beg the maintainers to add it in" can be seen as user-hostile.

Do you have counterarguments for why adopting a distribution model is better than a plugin model?

proffalken commented 1 year ago

I'm coming at this from a slightly different angle having recently switch to Grafana Agent after using either vector.dev or one of the various distros of the OTEL collector for the past couple of years.

The first thing I think is important to note is that having a single binary with all the "plugins" installed into it (whether they are the plugins that are being proposed, existing components or a combination of both) is actually really handy for most users as it means they don't have to worry about whether they've compiled the correct code and can just deploy a single binary/container. This is not, however, advocacy for keeping the status quo, and the point about OTEL Distros is very valid!

I have always appreciated the DataDog approach to plugins, which boils down to "You want to install it easily? You contribute upstream. You want something specific to you? Drop the code into this directory and we'll pick it up, but we won't support it". This approach gives the flexibility of custom plugins whilst maintaining plugin quality in the "core" repo.

The idea that a user could develop a plugin locally, run it on their own platform, and then contribute to "core" if they wanted to is a nice pattern, and it even allows folks to release plugins under their own github/NPM/whatever repo and have Grafana Agent "pick it up" from a directory on the filesystem if they want to use another license.

It does, however, mean that the agent needs to be able to load from disk at launch, and potentially be able to "reload" everything from disk whilst running depending on how advanced we want to make it.

Not sure if that makes sense, so ask any questions and I'll do my best to clarify! :laughing:

oferziss-armis commented 1 year ago

I'll give my 2 cents here as this kinda hit a soft spot for me recently.

One could argue that the approach that open telemetry took with the official and contrib distro of their collector allowed vendors to provide support for their own specific platform. Making the collector as agnostic as can be. but in the reality of things, the freedom of users in way was greatly reduced.

Pros

Allowing plugins and extensions to be easily added opened up the community both individual OSS developers and vendors to extend the functionality, it preserved the ability to keep "core" support on the official collector and not be "cornered" into giving out a guarantee of support for components not developed and approved by the core community.

Cons

Allowing vendors to develop plugins which provide custom support for their platforms allow the vendors to start implementing logic and requirements which diverge from the Otel spec. this causes some issues as the vendor now requires the users to implement custom logic in their systems which essentially create a sort of vendor locking.

My example is simple, a vendor i am using requires OTEL signals to be exported with 4 specific headers which indicate classification of the signals sent and is used to index the signals. these headers are not able to be provided with dynamic values based on the signal sent so i'm left with having to use their provided distro of the otel collector contrib which knows how to perform some deduction of headers from the signal being passed through the exporter. this makes life harder to adopt different agents as not all vendor are happy to develop support for all shipping agents.

Keeping stuff as close the the standard is always a good idea on the users perspective as it keeps the freedom to choose the solution which best fits their own specific needs.

Between this RFC and making the river language as complete as possible. I would choose the latter anytime.

With that said. i completely agree with @proffalken about always keeping stuff as a single binary. if i need to compile the agent on my CI it make life so much harder as i need to keep track of changes in the build process of the agent instead of having to simply download a released version and run it. in that sense creating a contrib distro of the agent and providing an easy interface for devs to add components while keeping up with the upstream is the best way to go for these types of things IMHO.

tpaschalis commented 1 year ago

I'm excited to see how this plays out; plugins sound like an exciting approach to a more modular Agent in the future! 👀

One thing that sticks out was the "Ability to migrate existing components", and whether this should be a hard requirement. I feel the added value of plugins might be enough even if we cannot migrate all current components. For example, if the performance overhead was a bit too much for say, prometheus.scrape to be usable as a plugin, I don't think it shouldn't allow us to use other components as plugins.

Are you worried that the two classes of components that come with a performance related warning would feel like second-class citizens?

rfratto commented 1 year ago

Addressing some of the comments here:

This proposal was very high level, so it probably didn't do a great job at helping envision what plugins could potentially be.

I could imagine adding something like this to a Flow config:

// Imports components from the plugin in the "otelcol" namespace.
plugin "otelcol" {
  url     = "github.com/grafana/flow-plugin-opentelemetry-collector"
  version = "1.0.1" 
}

// Imports components from the plugin in the "older_otelcol" namespace.
plugin "older_otelcol" {
  url     = "github.com/grafana/flow-plugin-opentelemetry-collector"
  version = "0.8.5" 
}

otelcol.receiver.otlp "default" { 
  http {}
  grpc {}

  output {
    traces = [older_otelcol.exporter.otlp.default] 
  }
}

older_otelcol.exporter.otlp "default" { ... }

This hypothetical design has a few interesting attributes:

This is just a sketch, and I'm not sure what the final proposal would look like, but I do not want plugins to require people to recompile the agent.

cc @proffalken

Allowing vendors to develop plugins which provide custom support for their platforms allow the vendors to start implementing logic and requirements which diverge from the Otel spec.

IMO, this is a good thing. Locking in components to only do OpenTelemetry will cause progress to be bottlenecked by when OpenTelemetry adopts a change. By necessity of attempting to be a global standard for all of telemetry data, OpenTelemetry will be slower to adopt new additions, as it needs to be careful.

For example, the pyroscope.* components in Flow do not use OpenTelemetry, since OpenTelemetry is still in progress of adopting a spec for profiles.

I don't want Flow to be limited to only OpenTelemetry components, and we don't even do that today; we have multiple sets of plugins from different ecosystem (prometheus., loki., pyroscope., otelcol., discovery.*).

I also don't want to limit plugins to only dealing with telemetry data. If someone wants to write a plugin with a component that provisions architecture, they should be free to do so.

in that sense creating a contrib distro of the agent and providing an easy interface for devs to add components while keeping up with the upstream is the best way to go for these types of things IMHO.

Unfortunately, the -contrib approach really doesn't fix the problems the maintainers are facing today as I mentioned earlier, specifically the one around dependency hell. If someone wants ~all the components, they will struggle to keep their distribution up to date.

Plugins will be a challenge to implement, but I think it will give users much more flexibility around what components are used, and prevent community fragmentation as there will only be one official binary of Flow, with many different plugins for different components to use.

cc @oferziss-armis

Are you worried that the two classes of components that come with a performance related warning would feel like second-class citizens?

Yes, and I also don't want to play favorites :) It would feel weird to me personally if we said "prometheus, otelcol, loki, pyroscope all get to stay in core for performance but everything else must be a plugin," especially since two of those are Grafana Labs products. We can make Flow be an open platform, but it means playing on the same field as everyone else.

I would prefer us to measure the overall impact of plugins and try our best to make the overhead as small as possible so we can become that open platform.

cc @tpaschalis

ptodev commented 11 months ago

I think we should firstly decide on what the user experience should be. For example:

The other proposals should be based off of that user experience goal. However, I am not sure if this should fall within this proposal or a sub-proposal.

rfratto commented 11 months ago

At this point, we're not sure what the technical limitations of plugins are. That will drive what we're going to be able to deliver, which may change what we end up exposing to end users.

While I'd normally agree to start from the user experience, I think this is a problem where some (but not all) technical information needs to be figured out first.

srclosson commented 4 months ago

Adding my 2 cents since I'm interested. Would love to see something where we could easily import or use telegraf plugins, since there are so many... Either linked as a go plugin, or referenced in code? Not sure...