open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.88k stars 2.26k forks source link

set() in Transform processor appears to be getting values from other logs #32080

Open mut3 opened 5 months ago

mut3 commented 5 months ago

Component(s)

pkg/ottl, pkg/stanza, processor/transform, receiver/filelog

What happened?

Description

We have a pipeline that uses a filelog input with some operators that set certain fields in attributes. We then have a transform processor that reads those fields to set() some Resource.attributes. The values that get set in the resource attributes of the log appear to be from a different log as if the transform is reading from one log and writing another.

Steps to Reproduce

I do not currently have a MVC for this issue, but I will include a stripped and sanitized config that contains the core logic/entities around the bug we are seeing.

Configuration ```yaml receivers: filelog: start_at: beginning operators: # route and parse log based on formatting - id: get-format type: router routes: - expr: body matches "^\\{" output: parser-docker - expr: body matches "^[^ Z]+ " output: parser-crio - expr: body matches "^[^ Z]+Z" output: parser-containerd - id: parser-crio type: regex_parser regex: ^(?P

Expected Result

Logs where attributes["pod_name"] == resource.attributes["k8s.pod.name"] and resource.attributes["k8s.namespace.name"] == attributes["namespace"]

This is the case for most of the logs emitted from our pipeline

Actual Result

For some small percentage of our logs, the values of

Do not match their attribute counterparts, but instead appear to come from a different log event. The fields appear to be internally consistent, as if they all came from the same different log event. For example if resource.attributes["k8s.pod.name"] is wrong and holds the name of some other pod, the resource.attributes["k8s.namespace.name"] will have the namespace of that other pod.

Here is a sanitized example log json pulled from our elasticsearch

Sanitized incorrect log ```json { "_index": ".ds-otel-logs-stg-my-app-2024.03.07-000022", "_id": "Pz1km44B8ceE1n7Gbbpq", "_version": 1, "_score": 0, "_source": { "@timestamp": "2024-04-01T20:40:10.983479227Z", "Attributes": { "OTEL_collector_type": "daemonset", "canonical": { "owner": { "team_name": "unknown" } }, "cloud": "aws", "cluster": "ci-stg", "container_name": "cilium-agent", "elasticsearch": { "index": { "suffix": "my-app" } }, "environment": "stg", "log": { "file": { "path": "/var/log/pods/kube-system_cilium-hlkmw_fb7fae7c-d3ac-4308-885d-5d1426c96bcc/cilium-agent/0.log" }, "iostream": "stderr" }, "logtag": "F", "namespace": "kube-system", "pod_name": "cilium-hlkmw", "restart_count": "0", "stream": "stderr", "time": "2024-04-01T20:40:10.983479227Z", "uid": "fb7fae7c-d3ac-4308-885d-5d1426c96bcc" }, "Body": "level=info msg=\"Delete endpoint request\" containerID=593b7f5fe7 subsys=daemon", "Resource": { "k8s": { "container": { "name": "my-app", "restart_count": "0" }, "namespace": { "name": "my-app-namespace" }, "pod": { "name": "my-app-65b4cf8769-bb66g", "uid": "33b38675-15a7-4f64-a3f7-f38a97455ce6" } } }, "Scope": { "name": "", "version": "" }, "SeverityNumber": 0, "TraceFlags": 0 }, "fields": { "Attributes.uid": [ "fb7fae7c-d3ac-4308-885d-5d1426c96bcc" ], "Attributes.stream": [ "stderr" ], "Attributes.environment": [ "stg" ], "Attributes.canonical.owner.team_name": [ "unknown" ], "Resource.k8s.container.restart_count": [ "0" ], "Attributes.log.iostream": [ "stderr" ], "Attributes.restart_count": [ "0" ], "Attributes.pod_name": [ "cilium-hlkmw" ], "Resource.k8s.namespace.name": [ "my-app-namespace" ], "TraceFlags": [ 0 ], "Resource.k8s.pod.name": [ "my-app-65b4cf8769-bb66g" ], "Scope.name": [ "" ], "Attributes.logtag": [ "F" ], "Resource.k8s.pod.uid": [ "33b38675-15a7-4f64-a3f7-f38a97455ce6" ], "Attributes.elasticsearch.index.suffix": [ "my-app" ], "Attributes.cloud": [ "aws" ], "SeverityNumber": [ 0 ], "Attributes.log.file.path": [ "/var/log/pods/kube-system_cilium-hlkmw_fb7fae7c-d3ac-4308-885d-5d1426c96bcc/cilium-agent/0.log" ], "Body": [ "level=info msg=\"Delete endpoint request\" containerID=593b7f5fe7 subsys=daemon" ], "Attributes.time": [ "2024-04-01T20:40:10.983Z" ], "Attributes.namespace": [ "kube-system" ], "Resource.k8s.container.name": [ "my-app" ], "@timestamp": [ "2024-04-01T20:40:10.983Z" ], "Attributes.container_name": [ "cilium-agent" ] } } ```

Collector version

v0.96.0

Environment information

Environment

OS: Amazon Linux 2023 (Amazon's v1.25.16-eks ami) Compiler(if manually compiled): Collector Helm Chart v0.84.0

OpenTelemetry Collector configuration

See Steps to Reproduce

Log output

No response

Additional context

We never experience this issue with these same moves to resources.attributes done via operators on the filelog receiver:

All operator solution ```yaml receivers: filelog: start_at: beginning operators: - id: get-format routes: - expr: body matches "^\\{" output: parser-docker - expr: body matches "^[^ Z]+ " output: parser-crio - expr: body matches "^[^ Z]+Z" output: parser-containerd type: router - id: parser-crio regex: ^(?P
github-actions[bot] commented 5 months ago

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

djaglowski commented 5 months ago

At a glance, I believe the issue is caused because ottl operates on a hierarchical data model which groups multiple log records by their shared resource attributes. In my opinion, when setting a value on the resource that comes from an individual log, this should produce a new resource with the updated value and move the log record into that resource.

An example with much unnecessary detail removed:

# input
logs:
  - resource:
    attributes:
      foo: bar
    records:
      - attributes:
          some.key: left
        body: Hello one
      - attributes:
          some.key: right
        body: Hello two

# Current result if we run "set(resource.attributes["num"], attributes["some.key"])
logs:
  - resource:
    attributes:
      foo: bar
      num: right # was "left" for one iteration then was overwritten
    records:
      - attributes:
          some.key: left
        body: Hello one
      - attributes:
          some.key: right
        body: Hello two

# Correct result (in my opinion)
logs:
  - resource:
    attributes:
      foo: bar
      num: left
    records:
      - attributes:
          some.key: left
        body: Hello one
  - resource:
    attributes:
      foo: bar
      num: right
    records:
      - attributes:
          some.key: right
        body: Hello two

In other words, the fact that our pdata packages represents data in a hierarchical manner should be an implementation detail rather than a guiding principle. User directed modifications should generally work as if the data model is completely flat (one log record per scope and resource) and split hierarchical structures accordingly.

mut3 commented 5 months ago

Thank your for the explanation!

Based on my understanding then, we should see many fewer issues such as these if we do not try to use ottl to modify resource attributes in processors. Would you expect that to workaround this behavior caused by pData structure grouping logs by resources?

TylerHelmuth commented 5 months ago

@djaglowski is correct.

When using the logs context in the transformprocessor it will cycles through each log, executing each statement in order. When you use a value from a log to set a property in that log's scope or resource all the logs for that scope/resource will try to do the same unless you add some clever where clauses.

As @djaglowski showed, if two logs in a resource have different values, the last (or first depending on Where clauses) will win - you cant get both values in the same resource attribute key unless you concat.

resource that comes from an individual log, this should produce a new resource with the updated value and move the log record into that resource.

I'm not sure we want this to be the default behavior either. It is totally valid for all the spans/datapoints/log records to share an attribute and the user wants that attribute set on the resource. If we always created a new resource each time we set from a "lower" context then we can end up with duplicate resources across the slice.

I also believe that if set had the capability of creating new Resource*Slice entries in my payload it would be overloaded. It would no longer only be setting an attribute, but also making the decision to create a new Resource*Slice. Having the function make that decision unilaterally feels incorrect for an OTTL function.

This issue has been known, but confusing/limiting, behavior since OTTL's inception. I think we'll need a more complex solution than to always split it out into its own Resource/Scope/Log_record combo. I see a couple options:

@mut3 since you're using the filelog receiver to collect this data we highly recommend doing as much work during the collection as you can. I recommend using the move operator to set the resource attributes instead of the transformprocessor since you have the opportunity to do it via the receiver. In general, the less components in your pipeline the better.

If you must use the transformprocessor for this data, I believe you could hack together a solution like this if you know the exact keys of all your resource attributes:

  1. Use a transformprocessor to move all the resource attributes to the logs:
  2. use a groupbyattributesor to rearrange the resource/scope/log relationship using ALL the resource keys you merged in step 1 and ALL the attribute keys you were trying to set.
  3. Use another transformprocessor to delete the resource keys of the log records.

I highly recommend using the filelog receiver instead of doing that approach lol

TylerHelmuth commented 5 months ago

/cc @evan-bradley

mut3 commented 5 months ago

Thank you!

We moved all our resource.attributes setters back to filelog operators and that appears to be working as expected.

Maybe there should be some warnings about modifying resource.attributes from a log context included in the docs for transformprocessor. As a new user I was not at all aware of this danger from my reading of the transformprocessor and ottl context documentation.

djaglowski commented 4 months ago

resource that comes from an individual log, this should produce a new resource with the updated value and move the log record into that resource.

I'm not sure we want this to be the default behavior either. It is totally valid for all the spans/datapoints/log records to share an attribute and the user wants that attribute set on the resource. If we always created a new resource each time we set from a "lower" context then we can end up with duplicate resources across the slice.

That's fair. What I should have said is move the log to a resource with the updated value, meaning that it either finds or creates the appropriate resource.

This issue has been known, but confusing/limiting, behavior since OTTL's inception. I think we'll need a more complex solution than to always split it out into its own Resource/Scope/Log_record combo.

It's a known issue but I think it's fundamentally a valid one which should eventually be fixed. The hierarchical representation (grouping logs/metrics/traces by resource and scope) is a direct reflection of pdata but not necessarily prescribed by the project's data model. In my opinion OTTL should not be designed for pdata but rather around it. That is, when data is moved or copied between contexts, it should behave logically as if the lower level context has its own copy of the higher level context(s). Representing the end result is an implementation detail.

Put another way, OTTL is treating resources and scopes a references, but users need to work with them as values.

cdanis commented 3 months ago

It's a known issue but I think it's fundamentally a valid one which should eventually be fixed. The hierarchical representation (grouping logs/metrics/traces by resource and scope) is a direct reflection of pdata but not necessarily prescribed by the project's data model. In my opinion OTTL should not be designed for pdata but rather around it. That is, when data is moved or copied between contexts, it should behave logically as if the lower level context has its own copy of the higher level context(s). Representing the end result is an implementation detail.

Put another way, OTTL is treating resources and scopes a references, but users need to work with them as values.

Hi. I am pretty sure I just got bit by this issue when trying to copy Span attributes into Resource attributes.

There's more context in the Wikimedia bug tracker but the short version is:

I'll see if I can use the groupbyattributes processor to work around this.

But, this is really confusing, and entirely undocumented. The transformprocessor documentation even suggests that you can do something like this, in the Contexts section of its README.md:

This means that statements associated to a span/metric/log have access to the telemetry's instrumentation scope, and resource.

From an OTTL end user's perspective, that sentence is only true for very specific meanings of the words "associated" and "the" :sweat_smile:

TylerHelmuth commented 3 months ago

If all the spans in the resource don't have the same upstream_cluster.name and _k8s_namespace you'll experience the behavior from this issue. We are open to better documentation around this behavior when writing to "higher" fields in the TransformContext.

As for a change to OTTL's behavior, @djaglowski idea of move the log to a resource with the updated value, meaning that it either finds or creates the appropriate resource would solve the issue, but would be a large breaking change to OTTL, which is currently designed to act only within the TransformContext and not interact with anything outside of it. I am open to solutions to this problem, but I don't have time to solve it myself right now.

I'm also open to documenting this behavior. Since it is a write-only problem, I believe it only affects the transformprocessor.

djaglowski commented 3 months ago

which is currently designed to act only within the TransformContext and not interact with anything outside of it.

I think there is a contradiction here. The design cannot consider modifications to a resource or scope to be "only within the TransformContext" of a log record when we know those changes are directly modifying the TransformContexts of other log records as well.

TylerHelmuth commented 3 months ago

I think there is a contradiction here. The design cannot consider modifications to a resource or scope to be "only within the TransformContext" of a log record when we know those changes are directly modifying the TransformContexts of other log records as well.

I was implying that the framework is made to act on a singular TransformContext, inclusive of a resource/scope, so if a transformation needed to go find a different resource to act upon it wouldn't have it.

I'm sure it is solvable, but I'd want to make sure that we don't lose:

  1. OTTL's performance
  2. OTLP hierarchical efficiency (I don't want to make a bunch of different ResourceLogs that have the exact same Resource)
  3. OTTL's reasonably simple syntax/language (we aren't trying to make OTTL Turing complete or anything)

If we can keep those things intact then we get the best of both sides: users can correctly transform fields from "higher" contexts in a fast and efficient way.

It is possible some ideas from https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/30800 would help here.

djaglowski commented 3 months ago

Here's a possible solution which would leave the data corruption issue in place but also provides a decent workaround for it.

In short, we could add a new TransformContext for the top level struct of each data type (e.g. plog.Logs) and within this context add Flatten and Unflatten functions. The user can then Flatten, transform with a lower level context, and then Unflatten.

Alternatively, perhaps Flatten / Unflatten can be automatically wrapped around the transformprocessor based on a flag in the config.

Example

Input

Say we have the following plog.Logs. I'm omitting scopes for the sake of simplicity, but I think the principles described below apply similarly.

resources:
  - attributes: { name: AAA }
    records:
      - attributes: { host: Red }
        body: First record
      - attributes: { host: Blue }
        body: Second record
      - attributes: { host: Red }
        body: Third record
      - attributes: { host: Blue }
        body: Fouth record
  - attributes:  { name: BBBBBB }
    records:
      - attributes: { host: Red }
        body: First record
      - attributes: { host: Blue }
        body: Second record
      - attributes: { host: Red }
        body: Third record
      - attributes: { host: Blue }
        body: Fouth record

Flatten

A Flatten function would produce a plog.ResourceLogs per plog.LogRecord, so that each plog.ResourceLogs represents a non-overlapping TransformContext. Therefore, modifications can now be applied without concern for data corruption between records.

resources:
  - attributes: { name: AAA }
    records:
      - attributes: { host: Red }
        body: First record
  - attributes: { name: AAA }
    records:
      - attributes: { host: Blue }
        body: Second record
  - attributes: { name: AAA }
    records:
      - attributes: { host: Red }
        body: Third record
  - attributes: { name: AAA }
    records:
      - attributes: { host: Blue }
        body: Fouth record
  - attributes:  { name: BBBBBB }
    records:
      - attributes: { host: Red }
        body: First record
  - attributes:  { name: BBBBBB }
    records:
      - attributes: { host: Blue }
        body: Second record
  - attributes:  { name: BBBBBB }
    records:
      - attributes: { host: Red }
        body: Third record
  - attributes:  { name: BBBBBB }
    records:
      - attributes: { host: Blue }
        body: Fouth record

Transform

The user can then apply a transformation, e.g. moving a log record attribute to a resource attribute.

resources:
  - attributes: { name: AAA, host: Red }
    records:
      - attributes: {}
        body: First record
  - attributes: { name: AAA, host: Blue }
    records:
      - attributes: {}
        body: Second record
  - attributes: { name: AAA, host: Red }
    records:
      - attributes: {}
        body: Third record
  - attributes: { name: AAA, host: Blue }
    records:
      - attributes: {}
        body: Fouth record
  - attributes: { name: BBBBBB, host: Red }
    records:
      - attributes: {}
        body: First record
  - attributes: { name: BBBBBB, host: Blue }
    records:
      - attributes: {}
        body: Second record
  - attributes: { name: BBBBBB, host: Red }
    records:
      - attributes: {}
        body: Third record
  - attributes: { name: BBBBBB, host: Blue }
    records:
      - attributes: {}
        body: Fouth record

Unflatten

Then an Unflatten function regroups the logs by resource.

resources:
  - attributes: { name: AAA, host: Red }
    records:
      - attributes: {}
        body: First record
      - attributes: {}
        body: Third record
  - attributes: { name: AAA, host: Blue }
    records:
      - attributes: {}
        body: Second record
      - attributes: {}
        body: Fouth record
  - attributes: { name: BBBBBB, host: Red }
    records:
      - attributes: {}
        body: First record
      - attributes: {}
        body: Third record
  - attributes: { name: BBBBBB, host: Blue }
    records:
      - attributes: {}
        body: Second record
      - attributes: {}
        body: Fouth record

Order & Reversibility

Since plog manages resources and records as elements within slices, e.g. plog.ResourceLogsSlice, Flatten and Unflatten can both preserve the order of log records. This means that Flatten and Unflatten are both reversible operations, i.e. X == Unflatten( Flatten( X ) ).

Additionally, for any transformation T we can say that Unflatten( T( Flatten( X ) ) ) is always at least deterministic, and often preserves order in limited ways. Finally, if a transformation T is itself reversible then the entire process is reversible as well.

evan-bradley commented 3 months ago

In short, we could add a new TransformContext for the top level struct of each data type (e.g. plog.Logs) and within this context add Flatten and Unflatten functions. The user can then Flatten, transform with a lower level context, and then Unflatten.

Alternatively, perhaps Flatten / Unflatten can be automatically wrapped around the transformprocessor based on a flag in the config.

I see how this would solve the issue at hand, but I'd like to avoid the extra steps if necessary unless flattening/unflattening achieves a goal outside of moving log records to a new resource object when modifying the resource from the log context.

If we're looking at creating new contexts, could we create contexts that handle the flattening/unflattening/copying/etc. operations for us? We could split the current resource path parser for each signal into separate parsers for each model, e.g. resource_rw for editing the resource without copies, and resource_ro for producing a new resource upon modification (happy to discuss other names like orig and copy or whatever is most obvious).

The ro context could have access to the parent signal objects on its interface and in its GetSetters could modify those objects when they are written to. So calling set(resource_ro["my_log_attr"], "attrvalue") inside the log context would move the log record to a new resource that has the new attribute/attribute value if a copy hasn't already been created, or would move it to an existing resource if one already exists with the same identity.

There would need to be a lot of internal machinery to support these operations, and we would need to decide upon and document what comprises resource object's identity, but I think this would solve this issue in a way that's both backwards-compatible and feels native to OTTL.

github-actions[bot] commented 1 month ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

TylerHelmuth commented 2 weeks ago

The flatten_data config option and associated feature gate have been in place for a while now and I've seen it solve end-user problems on multiple occasions. I would be ok moving forward with this solution for all signals.

For the sake of configuration simplicity, and compatibility with the current config, it would be nice if flatten_data can apply to all signals. For the sake of performance, though, would it be better to have signal-specific configuration? The counter argument would be to use different transformprocessor instances if a particular signal does not need flattening.

djaglowski commented 2 weeks ago

A related question here is exactly how flattening behaves for traces and metric.

For traces, I could see a need for every span to have its own copy of resource & scope. Or, you may need every span event to have its own copy of resource, scope, and span.

For metrics, would it ever be required to given each metric its own copy of resource & scope? Is it ever needed for each data point?

Maybe there's a general rule that it is possible to flatten down to any context?

TylerHelmuth commented 2 weeks ago

Maybe there's a general rule that it is possible to flatten down to any context?

Ya this is what I had in mind. Given flatten_data is enabled, the data is flattened for the specified context. So for the span context, each span gets its own resource and scope. For the datapoint context each datapoint gets its own resource, scope, and metric.

And then like the current implementation, after transformations occur the data would be regrouped.

For spanevent and datapoint this flattening would be pretty inefficient. But also it would be an advanced, opt-in situation.

djaglowski commented 2 weeks ago

@TylerHelmuth, I have run into a few use cases lately that reminded me that Flatten and Unflatten as OTTL functions would be useful:

TylerHelmuth commented 2 weeks ago

@djaglowski I think there has been enough requests for these features that I am open to more fine-grained control via functions. We also have the proper testing framework in place to ensure they work as expected.

I am curious if Flatten and Unflatten should instead be editors that work directly on the data. The idea of using Flatten to set something in the cache or an attribute doesn't feel quite right.

djaglowski commented 2 weeks ago

I am curious if Flatten and Unflatten should instead be editors that work directly on the data. The idea of using Flatten to set something in the cache or an attribute doesn't feel quite right.

I agree with this.