Log flooded with `string to non-string with op EQ`

atoulme commented 2 years ago

What happened?

Description

Upgraded to 0.60.0. Using a transform processor with a clause such as:

    transform:
      metrics:
        queries:
          - set(attributes["foo"], "bar") where attributes["foo"] == "foobar"

Steps to Reproduce

Set up a pipeline using a transform processor that uses an eq operator.

Expected Result

The pipeline should perform correctly.

Actual Result

Logs are flooded with:

string to non-string with op EQ
string to non-string with op EQ
string to non-string with op EQ
string to non-string with op EQ
string to non-string with op EQ
string to non-string with op EQ
string to non-string with op EQ
string to non-string with op EQ

Collector version

0.60.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04") Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

receivers:
    splunk_hec:

exporters:
    splunk_hec/metrics:
        # Splunk HTTP Event Collector token.
        token: "00000000-0000-0000-0000-0000000000000"
        # URL to a Splunk instance to send data to.
        endpoint: "https://splunk:8088/services/collector"
        # Optional Splunk source: https://docs.splunk.com/Splexicon:Source
        source: "app"
        # Optional Splunk source type: https://docs.splunk.com/Splexicon:Sourcetype
        sourcetype: "telegraf"
        # Splunk index, optional name of the Splunk index targeted.
        index: "metrics"
        # Maximum HTTP connections to use simultaneously when sending data. Defaults to 100.
        max_connections: 20
        # Whether to disable gzip compression over HTTP. Defaults to false.
        disable_compression: false
        # HTTP timeout when sending data. Defaults to 10s.
        timeout: 10s
        # Whether to skip checking the certificate of the HEC endpoint when sending data over HTTPS. Defaults to false.
        # For this demo, we use a self-signed certificate on the Splunk docker instance, so this flag is set to true.
        insecure_skip_verify: true

processors:
    batch:
    transform:
      metrics:
        queries:
          - set(attributes["foo"], "bar") where attributes["foo"] == "foobar"
extensions:
    health_check:
      endpoint: 0.0.0.0:13133
    pprof:
      endpoint: :1888
    zpages:
      endpoint: :55679

service:
    telemetry:
      logs:
        level: "debug"
    extensions: [pprof, zpages, health_check]
    pipelines:
      metrics:
        receivers: [splunk_hec]
        processors: [batch, transform]
        exporters: [splunk_hec/metrics]

Log output

No response

Additional context

No response

github-actions[bot] commented 2 years ago

Pinging code owners: @TylerHelmuth @kentquirk @bogdandrutu. See Adding Labels via Comments if you do not have permissions to add labels yourself.

TylerHelmuth commented 2 years ago

haha in trying to help users we destroyed their logs. Should definitely be logger.Debug.

dmitryax commented 2 years ago

@atoulme please add more context about the incoming data. I assume the metrics just don't always have the "foo" attribute, right? I don't think the output is incorrect in that case. Setting debug log level for all the logs can also hide other potential issues that user should be aware of, so I'm not sure if that's the right solution.

I believe the expression should have an additional condition verifying that metrics have foo attribute. Something like this:

set(attributes["foo"], "bar") where "foo" in attributes && attributes["foo"] == "foobar"

@TylerHelmuth does OTTL currently support something like this?

atoulme commented 2 years ago

Yes, specifically set a pipeline that doesn't have this attribute set to reproduce the issue. At the very least, this is a breaking change behavior that wasn't documented as part of the release and produces surprising output for end users, flooding their logs. I have not identified a workaround to address this behavior, please help.

kentquirk commented 2 years ago

OTTL does short-circuit evaluation and so something like set(attributes["foo"], "bar") where attributes["foo"] != nil && attributes["foo"] == "foobar" should prevent the error for the time being.

TylerHelmuth commented 2 years ago

@dmitryax In this situation the invalidComparison func exists to ensure that a boolean always bubbles up based on the comparison rules. We didn't want to silently use the "default" return value of false in these invalid situations, which is where the print statement came from.

I was initially thinking this log, and others like it, would be Debug level so that customers can set their log level to debug when trying to troubleshoot issues and then see this problem. I thought Debug log level printed all log statements; does it exclude Error, Warning, and Info?

dmitryax commented 2 years ago

@dmitryax In this situation the invalidComparison func exists to ensure that a boolean always bubbles up based on the comparison rules. We didn't want to silently use the "default" return value of false in these invalid situations, which is where the print statement came from.

I was initially thinking this log, and others like it, would be Debug level so that customers can set their log level to debug when trying to troubleshoot issues and then see this problem.

Just using print is not the best option. I agree that this particular error should be of debug level. But we need to bring logger there as @atoulme did.

I thought Debug log level printed all log statements; does it exclude Error, Warning, and Info?

If a user sets telemetry log level to Debug in their config, logs of all levels that are equal to debug or higher will be printed, it's everything excluding trace level

TylerHelmuth commented 2 years ago

Just using print is not the best option. I agree that this particular error should be of debug level. But we need to bring logger there as @atoulme did.

Agreed, his PR is what we need. @kentquirk and I have already provided feedback for next steps with that PR (member functions)

atoulme commented 2 years ago

OTTL does short-circuit evaluation and so something like set(attributes["foo"], "bar") where attributes["foo"] != nil && attributes["foo"] == "foobar" should prevent the error for the time being.

We tried this - it appears && is not supported and should be replaced with and.

With set(attributes["foo"], "bar") where attributes["foo"] != nil and attributes["foo"] == "foobar", the workaround works properly. Thanks!

atoulme commented 2 years ago

An update further on the workaround. Upon using the workaround, the test case above complies and there are no longer any logs generated.

However, please see this example:

receivers:
    splunk_hec:

exporters:
    logging:

processors:
    batch:
    attributes/example:
      actions:
        - key: foo
          action: upsert
          value: "bar"
    transform:
      logs:
        queries:
          - set(attributes["foo"], "foobar") where attributes["foo"] != nil and attributes["foo"] == "bar"

extensions:
    health_check:
      endpoint: 0.0.0.0:13133
    pprof:
      endpoint: :1888
    zpages:
      endpoint: :55679

service:
    extensions: [pprof, zpages, health_check]
    pipelines:
      logs:
        receivers: [splunk_hec]
        processors: [batch, attributes/example, transform]
        exporters: [logging]

In this example, we make sure that the value of attributes["foo"] is not nil. This generates logs:

string to non-string with op NE
string to non-string with op NE
string to non-string with op NE
string to non-string with op NE
string to non-string with op NE
string to non-string with op NE
string to non-string with op NE
string to non-string with op NE
string to non-string with op NE
string to non-string with op NE
string to non-string with op NE

This is because the first check is now comparing a string and nil, which is considered a violation of comparison types.

dmitryax commented 2 years ago

Ok looks like != nil is incorrect way to handle this use case. We need a key in map or map has key functionality in OTTL.

TylerHelmuth commented 2 years ago

Ya or to relax on what we consider an invalid comparison. @kentquirk what do you think?

atoulme commented 2 years ago

This is now fixed in 0.61.0. Closing.

open-telemetry / opentelemetry-collector-contrib