Introduce TargetAllocator CRD

swiatekm commented 7 months ago

Component(s)

target allocator

Describe the issue you're reporting

The Target Allocator subresource of the OpenTelemetryCollector CR has been growing in size over time, in particular due to various infrastructure-related attributes that users rightfully want to set for it. It’s time to split it into its own CRD, improving separation of concerns and reducing the size of the Collector CRD.

Design

The TargetAllocator CRD should contain all the fields the Target Allocator subresource currently does, to begin with. We will later continue adding fields to it, while possibly deprecating some fields from the subresource. The subresource should always only contain a subset of the CRD fields. Internally, the operator will create a CRD based on the subresource, which is how we’ll maintain backwards compatibility.

apiVersion: opentelemetry.io/v1alpha1
kind: TargetAllocator
metadata:
  name: example
spec:
  scrapeConfigs:
    - job_name: otel-collector
      scrape_interval: 10s
      static_configs:
      - targets:
        - 0.0.0.0:8888
        - 0.0.0.0:9999
  prometheusCR:
    enabled: true
  ..

Relation to the Collector CR

A Collector can be configured to use a given target allocator by setting a label on the collector CR:

apiVersion: opentelemetry.io/v1beta1
kind: TargetAllocator
metadata:
  name: example
  labels:
    opentelemetry.io/target-allocator: example
spec:
  config:
    receivers:
      prometheus:
  ...

Collectors and Target Allocators are independent, neither owns the other, and each can in principle run on their own. By establishing a connection between them, certain aspects of their configuration will change under the hood.

Rollout

We should be able to roll out most of the necessary changes without any user-facing change:

Add the type definitions for the TargetAllocator CRD
Generate the TargetAllocator CR from the subresource
Use the generated TargetAllocator CR to generate target allocator manifests
Add webhooks for TargetAllocator without enabling them
Enable the webhooks and reconciliation for the TargetAllocator CR

jaronoff97 commented 7 months ago

some thoughts:

I think this should only be in v2 i.e. apiVersion: opentelemetry.io/v1alpha2 This will let us take advantage of the work for #901 (embedding a common fields struct)
Having this in v2 only would let us begin with some fields in the TA spec deprecated or removed
I think there will need to be a reference from a target allocator to a collector. We should improve this not only for the TA but also for instrumentation by introducing a new type or field called "collectorRef" or something like that which the operator can validate and automatically set up the networking
For the ownership of the CRD, I think we could set a controller reference from the collector to properly cascade a delete, what would happen if someone deletes their TA?
I don't think the relation needs to be bidirectional, it should be enough to infer the relationship from the TA to the collector by doing the validation i mentioned above. i.e. we get a request to reconcile a TA CRD at which point we validate the collector ref and then update the collector manifests (read: config) appropriately
for the rollout plan, i blieve we would need to enable webhooks and reconciliation before we start generating the TA CR from the subresource, unless we wanted to support both the subresource and the CRD. I think we should only support one at a time so I think the rollout could go like this:

Rollout

Add the type definitions for the TargetAllocator CRD
Use the generated TargetAllocator CR to generate target allocator manifests
Add webhooks for TargetAllocator without enabling them
Enable the webhooks and reconciliation for the TargetAllocator CR
Generate the TargetAllocator CR from the subresource

swiatekm commented 7 months ago

I think this should only be in v2 i.e. apiVersion: opentelemetry.io/v1alpha2 This will let us take advantage of the work for https://github.com/open-telemetry/opentelemetry-operator/issues/901 (embedding a common fields struct) Having this in v2 only would let us begin with some fields in the TA spec deprecated or removed

Agreed, although I'm not sure I want it to tie enabling it to v2 release. We can always release v2 collector and enable the TA CRD in a later version.

I think there will need to be a reference from a target allocator to a collector. We should improve this not only for the TA but also for instrumentation by introducing a new type or field called "collectorRef" or something like that which the operator can validate and automatically set up the networking For the ownership of the CRD, I think we could set a controller reference from the collector to properly cascade a delete, what would happen if someone deletes their TA? I don't think the relation needs to be bidirectional, it should be enough to infer the relationship from the TA to the collector by doing the validation i mentioned above. i.e. we get a request to reconcile a TA CRD at which point we validate the collector ref and then update the collector manifests (read: config) appropriately

Wouldn't it be simpler to have an optional targetAllocatorRef field on the Collector CR, and set the Collector CR as the owner of the TargetAllocator CR? Then you get the cascading delete and can know which Collector owns the TA from ownerReferences. Then we could update Collector config in Collector reconciliation, which sounds much simpler.

From a usability POV, I think it's important to be able to tell just from the Collector CR whether the Collector as a TA associated.

for the rollout plan, i blieve we would need to enable webhooks and reconciliation before we start generating the TA CR from the subresource, unless we wanted to support both the subresource and the CRD. I think we should only support one at a time so I think the rollout could go like this:

I meant that we can generate the TA CR, but not actually apply it, but instead add it to the context and use it to generate TA manifests. This way our manifest generation would already be using the same logic as for "real" TA CRs, and the only change would be to actually apply the CR once we're ready to enable it. Does that make sense? I'll put up a draft PR showing what I have in mind later.

This way, it'd be easy to support the CR and the subresource simultaneously, which we should do in my opinion. It might even make sense to support the subresource indefinitely for users who just need a simple configuration and don't want the complexity of dealing with multiple CRs.

pavolloffay commented 7 months ago

Wouldn't it be simpler to have an optional targetAllocatorRef field on the Collector CR

+1

swiatekm commented 6 months ago

I've thought about this a bit more, and I think I've arrived at a solution to the two-way binding problem. Our requirements are:

The TargetAllocator CR should have some way of referring to a set of collector Pods
The TargetAllocator should run even in the absence of said collectors, doing nothing
It should be easy to have a TargetAllocator CR target Pods from a particular Collector CR

I propose that we use the standard label matching mechanism for this, and also provide a shortcut via an annotation.

To directly set a collector selector:

apiVersion: opentelemetry.io/v1alpha2
kind: TargetAllocator
metadata:
  name: example
spec:
  collectorSelector:
    matchLabels:
      app.kubernetes.io/instance: example-collector
  ...

Or alternatively, via an annotation:

apiVersion: opentelemetry.io/v1alpha2
kind: TargetAllocator
metadata:
  name: example
  annotations:
    targetallocator.opentelemetry.io/collector: "example-collector"

Then the resolution flow for the Collector CR with a TargetAllocator reference becomes:

Get the TargetAllocator resource based on the ref
Set the annotation on it, pointing to Collector CR
TargetAllocator reconciler picks up the change and sets the right selector on TargetAllocator CR
TargetAllocator is ready to allocate targets once Collector Pods become ready

The Collector CR depends on the TargetAllocator CR in this flow, but not vice versa.

jaronoff97 commented 6 months ago

Confirming I support this idea! After our discussion in the SIG, I think this is a good path forward. For now, I think we should just begin with setting the collector setter and open a separate issue for the annotation to be used as a convenience after some mileage with the new types.

swiatekm commented 1 month ago

I'm a bit stuck on the shortcut mechanism for connecting OpenTelemetryCollector and TargetAllocator CRs. The approach from https://github.com/open-telemetry/opentelemetry-operator/issues/2422#issuecomment-1862635932 works, but I don't like its ergonomics.

The requirements for this connection are the following:

TargetAllocator needs a selector for collector Pods to assign targets to.
TargetAllocator needs to know the scrape configs the collector has in its prometheus receiver configuration.
Collector needs to know which endpoint it should get targets from.

So in principle, the collector needs to know which target allocator it's connected to, and the target allocator also needs to know which collectors it's assigning targets to.

If we use an annotation or label for this, we need to decide what the source of truth is. What happens if a user sets an annotation, and also sets their own collector selector in the TargetAllocator? Which setting wins, or should we just error and refuse to reconcile?

The other question is where the annotation or label should be placed. On both CRs would be most explicit, but also a bit redundant and probably a bit more annoying for users. Putting it on only one of the CRs is sufficient, especially if it's a label, but then the other CR has its behaviour altered without any change to the CR itself. Maybe this is fine? This is how EndpointSlices are assigned to Services, for example.

swiatekm commented 1 month ago

Discussed the above during the SIG meeting. The conclusions were:

Manage the connection via a label on the OpenTelemetryCollector CR. It is then easy for the Target Allocator reconciliation to track changes to the CR.
Add the collectors connected to the TargetAllocator CR to the status of that CR. This makes the connection visible without requiring the user to specify it twice.
Drop the collectorSelector field and only use the label. This supports basically all the required use cases.
In case of conflicts in Scrape Configs, just use all of them. It's up to the user to ensure there's no duplicates.

swiatekm commented 3 weeks ago

I realized that the above results in some awkwardness with the embedded TargetAllocator, as we'd need to add a label to the Collector CR while reconciling it, despite the Collector CR owning the TargetAllocator CR. But I think we can just use owner references instead of the label in that case.

open-telemetry / opentelemetry-operator