open-telemetry / opentelemetry-collector

OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
4.21k stars 1.39k forks source link

Component instancing is complicated. Is it really necessary? #10534

Open djaglowski opened 1 month ago

djaglowski commented 1 month ago

Is your feature request related to a problem? Please describe.

Users define component.ID's in their collector configurations using the format type[/name] where type is for example otlp or filelog and name is an additional arbitrary string which helps identify a unique configuration.

When the collector runs with a given configuration, it often will create and run multiple instances of a given component. The rules that govern how instancing works are not well documented but I have previously described them in detail here.

In short:

I believe that instancing solves some problems but introduces others. Ultimately, I believe we may be better off removing the notion of instancing as described below. Although we are working aggressively towards 1.0 of the collector, I think we should seriously consider this change because it would substantially improve one requirement of 1.0 (self-observability).


Terminology

Even discussing this issue is difficult because we do not have clear terminology so I will attempt to introduce some terms here which are useful for the rest of the discussion:

A "Component Configuration" is an element which a user defines within the receivers, processors, exporters, connectors, or extensions section. For example, in the following configuration, there are 3 Component Configurations (otlp/in, batch and otlp/out):

receivers:
  otlp/in:
processors:
  batch:
exporters:
  otlp/out:

A "Component Instance" is a corresponding struct which the collector instantiates internally. The following configuration will result in more than 3 Component Instances:

service:
  pipelines:
    logs/1:
      receivers: [ otlp/in ]
      processors: [ batch ]
      exporters: [ otlp/out ]
    logs/2:
      receivers: [ otlp/in ]
      processors: [ batch ]
      exporters: [ otlp/out ]
    traces/1:
      receivers: [ otlp/in ]
      processors: [ batch ]
      exporters: [ otlp/out ]

If you are doubting that our current instancing behavior is confusing, please consider whether you can easily answer the following:

  1. How many Component Instances are instantiated for the above configuration?
  2. For each instance, what is an appropriate set of attributes that allows a user to understand that a given piece of telemetry originated from that instance?

(my answer here)

Problem 1 - Observability

Observability is all about understanding the internal state of a system. As an observability tool it is critical that we are a good exemplar of an observable system. However, our current notion of instancing is difficult to understand and therefore makes the collector less observable than it should be.

Users should be able to understand when telemetry describes a specific Component Instance but we do not currently have an externally defined schema for this. In fact, proper identity is dependent on both the class (receiver/processor/exporter/connector) and type (otlp/filelog) of component so it may not even be possible to define a static schema for identity attributes.

Additionally, users should be able to understand the relationship between a Component Configuration and its associated Component Instance(s). Even if the identify problem were reasonably well solved, any attempt to communicate to users about a specific Instance is muddled behind these rules that define these relationships. Even a simple configuration such as the example above may require a deep understanding of collector internals in order to understand the number of Component Instances, how they each relate to a Component Configuration, and which pipeline or pipelines contain which Component Instances.

Problem 2 - Maintainability

Identifying and managing Component Instances alongside Component Configurations is difficult to do even within the collector codebase.

The effort to design and implement a robust component status model was largely complicated by the fact that distinct Component Instance that each correspond to the same Component Configuration may be in different states. For example, an exporter which pushes logs to one endpoint and traces to another should be very clear if one is healthy while the other is not.

Additionally, having worked on some early prototypes of a hot-reload capability, I recently discovered it is very difficult to accomplish something as simple as getting a handle to the set of Component Instances that were instantiated from a given Component Configuration. (Once the collector is started, if we want to apply a new config which only changes a single Component Configuration, we need to get a handle to the corresponding Component Instances before we can take any meaningful action.)

Describe the solution you'd like

I think we should seriously consider whether or not component instancing is actually a net positive, and what it would look if each Component Configuration were instantiated exactly once. In order to determine whether this model would be an improvement it is necessary to consider the reasons why we currently create Component Instances.

Processors

Processors are currently instanced in a unique way relative to other components. Every pipeline is given a unique instance of a processor in order to ensure that pipelines do not overlap. For example:

pipelines:
  logs/1:
    receivers: [ otlp/in/1 ]
    processors: [ batch ]
    exporters: [ otlp/out/1 ]
  logs/2:
    receivers: [ otlp/in/2 ]
    processors: [ batch ]
    exporters: [ otlp/out/2 ]

If batch were a single instance, then the inputs from both pipelines would all flow into the same processor, and the output of that processor would flow to both exporters. Essentially it would be the same as:

pipelines:
  logs:
    receivers: [ otlp/in/1, otlp/in/2 ]
    processors: [ batch ]
    exporters: [ otlp/out/1, otlp/out/2 ]

However, users could very easily explicitly configure their pipelines to remain separate by declaring two separate batch processors:

pipelines:
  logs/1:
    receivers: [ otlp/in/1 ]
    processors: [ batch/1 ]
    exporters: [ otlp/out/1 ]
  logs/2:
    receivers: [ otlp/in/2 ]
    processors: [ batch/2 ]
    exporters: [ otlp/out/2 ]

Importantly, we would lose one clear benefit here, which is that a single processor configuration could not be reused. In other words, the user would have to define batch/1 and batch/2 separately in the processors section. This may be a small problem in many cases but for complex processor configurations it could lead to a lot of duplicated configuration. Users can instead rely on YAML anchors if necessary. See Explicitly Instanced Configurations below for an explanation of how this can be resolved.

A notable implementation detail for this design is that we would need the service graph to manage fanouts and capabilities for every consumer. This is certainly doable and might actually be less complicated because we can have one generic way to calculate these factors vs the current logic which has one set of logic immediately after receivers and another immediately before exporters.

Receivers & Exporters

Receivers and exporters are currently instanced per data type. I believe the primary reason for this is to protect against order of instantiation bugs. For example, consider the following psuedocode:

// Current pattern
factory := GetFactory("foo")
logsReceiver := factory.CreateLogsReceiver(...)
tracesReceiver := factory.CreateTracesReceiver(...)

// Possible alternative pattern
factory := GetFactory("foo")
receiver := factory.CreateReceiver(WithLogs(...), WithTraces(...))

In the alternative pattern there is a chance that during instantiation of the component, each additional type may introduce an incorrect interaction into the state of the struct. In my opinion, this is the biggest open question about the feasibility and impact of this approach. That said, this is already opted-into by components using the sharedcomponent pattern. I don't believe it has proven to be very difficult to work with in those cases, so perhaps this is not much of a concern. If there is sufficient interest in this proposal I will investigate further.

Connectors

Connectors are currently instanced per ordered pair of data types. I believe the considerations here are inherited from receivers and exporters so I am not aware of any unique considerations at this point.

Extensions

Extensions already follow the one-instance-per-configuration pattern which I am proposing to apply across the board.

### Explicitly Instanced Configurations

As mentioned earlier one of the immediate drawbacks of removing our current instancing logic is that users could not necessarily reuse configurations for processors. This is already the case for other components but arguably it is useful functionality nonetheless so I think we should consider how else to provide it. First, there is a notion of YAML anchors which we could recommend. However, I believe it would not be difficult to provide users a native alternative.

In short, we can make all Component Configurations reusable by allowing users to explicitly instance components within the pipeline definitions. I believe this would require us to designate one additional special character for this syntax. As a placeholder I'll use # but obviously we would need to consider this more carefully. A complete example:

receivers:
  otlp/in/1:
  otlp/in/2:
processors:
  transform:
    # lots of statements I don't want to repeat
exporters:
  otlp/out/1:
  otlp/out/2:

pipelines:
  logs/1:
    receivers: [ otlp/in/1 ]
    processors: [ transform#1 ]
    exporters: [ otlp/out/1 ]
  logs/2:
    receivers: [ otlp/in/2 ]
    processors: [ transform#2 ]
    exporters: [ otlp/out/2 ]

In this example, we've defined a Component Configuration for a transform processor once, and then used it twice by applying an explicit name to each usage of it. Two separate instances are created because we explicitly indicated to do so. Additional scrutiny of this syntax is necessary but the point is that some syntax could presumably allow users to indicate whether an additional instance of the configuration is intended.

yurishkuro commented 1 month ago

Q: what are the benefits of using a custom syntax transform#1 over standard YAML anchors? Personally I prefer solution where users need to learn fewer bespoke concepts.

mx-psi commented 1 month ago

I am also interested in why we need the #1 syntax instead of YAML anchors. One (very small) problem to overcome with the #1 syntax is that batch/something#1 (or any other syntax really, anything goes after /) is currently accepted as a valid name for a component (with type batch and name something#1) so we would have to restrict names in identifiers to suppoert something like this (we may want to do that regardless of what the outcome of this discussion is, seems like something we may want to do and nobody complained when we restricted component.Type). The migration from the current state to explicitly instanced configurations is also something we would have to work on (migration could be automated but we still don't have something like #7631).

Another question I have is whether it makes sense to consider the changes in isolation? Would it be a net positive if we land the processors change but not the receivers/exporters/connectors one, or viceversa?

djaglowski commented 1 month ago

Re #1 syntax vs YAML anchors - I only meant to highlight this as a possible mechanism to minimize changes needed for migration. i.e. It would allow us to document an extremely simple path to migrate which would ensure exactly the same functionality as before. All a user would need to do is append #1, #2, etc to reused processors. That said, I generally agree with @yurishkuro's point that this is bespoke functionality and ultimately we'd be better off without it. I'm going to strike this from the proposal so we can focus on the main ideas.

djaglowski commented 1 month ago

Another question I have is whether it makes sense to consider the changes in isolation? Would it be a net positive if we land the processors change but not the receivers/exporters/connectors one, or viceversa?

This is a good question. It's prompted me to reframe how these changes could be described. I think we could look at it more generally as "instancing by pipeline" and "instancing by data type".

Processors are unique in that they are instanced by pipeline, but this effectively means they are also instanced by data type because every pipeline has a data type. Therefore, I think we could "step down" processors to "instancing by data type". If nothing else, this would have the benefit of reducing the overall complexity of the instancing discussion because all pipeline components would instance by data type only.

Then, we can look separately at whether instancing by data type is still beneficial.

mx-psi commented 1 month ago

After discussion on the 2024-07-11 Collector Stability working group, I am going to add this to self-observability milestone and the Collector 1.0 project board.

codeboten commented 1 month ago

@djaglowski i agree that the instancing question is quite complex. i wonder how instancing processor per data type would work across all the existing processors. Thinking through this, i think for most processors it wouldn't make much of a difference.

Would this be a problem for the tailsampling processor? An example that comes to mind where it seems like it might be a problem is around something like rate limiting configuration, applied on a per signal basis instead of per pipeline, may cause unexpected behaviour changes.

mwear commented 1 month ago

I'm glad to see this as part of Collector: v1. Component instancing was a big challenge when implementing component health, and will be a challenge for users trying to understand the health of their collectors. I'm not sure what the ideal solution looks like, but I'm definitely supportive of this work. I'd be very interested in how we might be able to clean up the shared component complications.

djaglowski commented 1 month ago

Would this be a problem for the tailsampling processor? An example that comes to mind where it seems like it might be a problem is around something like rate limiting configuration, applied on a per signal basis instead of per pipeline, may cause unexpected behaviour changes.

Yes, this would certainly be a breaking change. I think my original example with the batch processor demonstrates the effect. Basically, if you have two logs pipelines that share an instance of the batch processor, then you are batching together logs from both pipelines. The same pattern would apply to rate limiting w/ the tailsampling processor.

I assume if we implement any part of this proposal we will have to message it loud and clear. I'm confident there is reasonable way to advise users on how to preserve the same behavior we have today. e.g. "If you want the same behavior as before, just declare an instance of the processor for each pipeline."

processors:
  tailsampling/1: ...
  # tailsampling/2: ...

service:
  pipelines:
    traces/1:
      receivers: ...
      processors: [ tailsampling/1 ]
      exporters: ...
    traces/2:
      receivers: ...
      processors: [ tailsampling/1 ] # shared with traces/1
      # processors: [ tailsampling/2 ] # distinct from traces/1
      exporters: ...

To expand on this, here's a simplified version of my original example:

    traces/1:
      receivers: [ in/1 ]
      processors: [ foo ]
      exporters: [ out/1 ]
    traces/2:
      receivers: [ in/2 ]
      processors: [ foo ]
      exporters: [ out/2 ]

Here, data flows from in/1 to foo then fans out to both out/1 andout/2. Likewise, data fromin/2flows tofoothen fans out to bothout/1andout/2. So this is the same as having one pipeline which contains both receivers, the processor, and both exporters. Again, it's a breaking change, but it's simple enough to declarefoo/1andfoo/2` if you want isolated pipelines.

There are also situations where sharing processors between pipelines is structurally problematic. Iterating on the above example a bit:

    traces/1:
      receivers: [ in/1 ]
      processors: [ foo, bar ]
      exporters: [ out/1 ]
    traces/2:
      receivers: [ in/2 ]
      processors: [ bar, foo ]
      exporters: [ out/2 ]

What happens here? I think the answer would have to be that there is a cycle, so it is an invalid configuration. But again, if we can say clearly that "Each Component Configuration is instantiated exactly once (or once per data type).", then the solution is simply to explicitly declare your instances. e.g. Use foo/1 and bar/1 in one pipeline, bar/2 and foo/2 in the other.

All of this is to say that this proposal is explicitly a breaking change, but if we settle on a model of explicit instancing, it's much easier to reason about precisely what is shared and what is not shared.

djaglowski commented 1 month ago

I'd be very interested in how we might be able to clean up the shared component complications.

If we move to a model where each Component Configuration is instantiated exactly once, then we no longer would need shared component at all. All components would just always work that way.

However, we need to determine how factories change, because instead of NewMetricsProcessor and NewLogsProcessor, we probably need a way to ask for NewProcessor(WithLogs(...), WithMetrics(...)), which changes component implementations. This certainly requires more consideration.

mwear commented 1 month ago

This makes sense. From the component health perspective, we will be interested in knowing what pipelines a component belongs to. I'm sure there are solutions to this problem and it's one that should be on our list of requirements.

mx-psi commented 1 month ago

Is there any way we can make this backwards compatible or minimize user impact or any breakage? I think this is something good to do, but I would like to avoid major changes in the configuration format at this point (minor changes or edge cases should be fine with a migration plan)

djaglowski commented 1 month ago

Is there any way we can make this backwards compatible or minimize user impact or any breakage? I think this is something good to do, but I would like to avoid major changes in the configuration format at this point (minor changes or edge cases should be fine with a migration plan)

I think there would have to be a breaking change but we can do a lot to mitigate the pain.

First, I think it's likely that many users are not affected at all. Still, I would expect we do at least the following:

  1. A slow-moving feature gate would control whether the current or new behavior is used. When using the current behavior, check for cases where processors are reused and print a warning & link to guidance about how to migrate.
  2. A detailed blog post that explains the motivation for the change, along with diagrams and configuration examples to help users understand what is required to migrate.

As mentioned above, these are the situations where processor configuration reuse would need be migrated:

  1. Reuse of stateful processor configuration. Since state would be shared across pipelines, it is likely that data streams will interact with each other.
  2. Serial use of the same processor configuration. e.g. otlp -> filter -> forward -> filter -> otlp In this case, filter causes a cycle.
  3. Parallel use of the same processor configuration. e.g. otlp/in/1 -> filter -> otlp/out/1 and otlp/in/2 -> filter -> otlp/out/2. In this case, what comes out of filter flows to both exporters.

Examples can be quite verbose to introduce users to YAML anchors. Given the following configuration:

receivers:
  filelog/1: # Because this is a receiver, this represents one *instance*.
    include: local/anchors/1.log
  filelog/2:
    include: local/anchors/2.log

processors:
  # Currently, because this is a processor, this is *not* an instance, but a *reusable configuration*.
  batch: 
    send_batch_size: 10000
    timeout: 10s

exporters:
  file/1: # Because this is an exporter, this represents one *instance*.
    path: local/anchors/out/1.log
  file/2:
    path: local/anchors/out/2.log

service:
  pipelines:
    logs/1:
      receivers: [ filelog/1 ]
      processors: [ batch ] # Use the processor configuration once.
      exporters: [ file/1 ]
    logs/2:
      receivers: [ filelog/2 ]
      processors: [ batch ] # Use the same processor configuration again.
      exporters: [ file/2 ]

With the changed behavior, processors are instanced the same way as receivers and exporters. This means that what you define in the processors section are no longer reusable configurations, but named instances, just as with receivers and exporters.

However, if you want to reuse a configuration, you can do so with YAML anchors. To migrate the above example, we can declare two separate instances of the batch processor. A YAML anchor allows us to reuse the configuration.

...

processors:
  # Declare a processor instance named 'batch/1' AND a configuration named 'myReusableBatchConfig'.
  batch/1: &myReusableBatchConfig 
    send_batch_size: 10000
    timeout: 10s

  # Declare another instance and reference the 'myReusableBatchConfig' configuration.
  batch/2: *myReusableBatchConfig 

...

service:
  pipelines:
    logs/1:
      receivers: [ filelog/1 ]
      processors: [ batch/1 ] # updated from 'batch' to 'batch/1'
      exporters: [ file/1 ]
    logs/2:
      receivers: [ filelog/2 ]
      processors: [ batch/2 ] # updated from 'batch' to 'batch/2'
      exporters: [ file/2 ]
djaglowski commented 1 month ago

I've opened https://github.com/open-telemetry/opentelemetry-collector/pull/10664 to demonstrate what instancing processors per data type could look like.