grafana / alloy

OpenTelemetry Collector distribution with programmable pipelines
https://grafana.com/oss/alloy
Apache License 2.0
1.43k stars 211 forks source link

Flow mode otelcol components do not shutdown gracefully #330

Open ptodev opened 1 year ago

ptodev commented 1 year ago

What's wrong?

During Agent shutdown, a send on closed channel panic occurs. It is seemingly because otelcol.processor.batch tries to send data to otelcol.processor.tail_sampling but the tail sampling component has stopped already:

^Cinterrupt received
ts=2023-10-19T11:24:17.989506Z level=error msg="failed to start reporter" err="context canceled"
ts=2023-10-19T11:24:17.989754Z level=info msg="component exited" component=otelcol.processor.tail_sampling.default
ts=2023-10-19T11:24:17.989828Z level=info msg="component exited" component=otelcol.auth.basic.grafana_cloud_tempo
panic: send on closed channel

I am not sure if this issue is limited to otelcol components or if it also applies to other ones.

Steps to reproduce

Run an Agent with the config file below, let some traces in, then press Ctrl+C. The error doesn't happen at every shutdown. I suppose it's only when a trace is sent to a downstream component, so there is some luck involved with reproducing it. But in my experience it happens half of the time I Ctrl+C an Agent locally.

System information

MacOS 14.0

Software version

Agent "main" branch, commit 1d06721c674b7dc8b510b75b5c

Configuration

otelcol.receiver.otlp "default" {
    grpc {
        endpoint = "0.0.0.0:4320"
    }

    output {
        traces  = [otelcol.processor.batch.default.input]
    }
}

otelcol.processor.batch "default" {
    timeout = "5s"
    send_batch_size = 100

    output {
        traces  = [otelcol.processor.tail_sampling.default.input]
    }
}

otelcol.processor.tail_sampling "default" {
  decision_wait               = "5s"
  num_traces                  = 50000
  expected_new_traces_per_sec = 0

  policy {
    name = "test-policy-1"
    type = "probabilistic"

    probabilistic {
      sampling_percentage = 10
    }
  }

  policy {
    name = "test-policy-2"
    type = "status_code"

    status_code {
      status_codes = ["ERROR"]
    }
  }

  output {
    traces = [otelcol.exporter.otlp.default.input]
  }
}

otelcol.exporter.otlp "default" {
    client {
        endpoint = "tempo-prod-06-prod-gb-south-0.grafana.net:443"
        auth     = otelcol.auth.basic.grafana_cloud_tempo.handler
    }
}

otelcol.auth.basic "grafana_cloud_tempo" {
    username = ""
    password = ""
}

Logs

└─▪ ./build/grafana-agent-flow run /Users/paulintodev/Desktop/otel_test.river
ts=2023-10-19T11:22:45.109611Z level=info "boringcrypto enabled"=false
ts=2023-10-19T11:22:45.114399Z level=info msg="starting complete graph evaluation" controller_id="" trace_id=66c82249ab80a585f4673630c254fec8
ts=2023-10-19T11:22:45.114416Z level=info msg="finished node evaluation" controller_id="" trace_id=66c82249ab80a585f4673630c254fec8 node_id=otel duration=625ns
ts=2023-10-19T11:22:45.113907Z level=info msg="running usage stats reporter"
ts=2023-10-19T11:22:45.115107Z level=info msg="finished node evaluation" controller_id="" trace_id=66c82249ab80a585f4673630c254fec8 node_id=otelcol.auth.basic.grafana_cloud_tempo duration=686.209µs
ts=2023-10-19T11:22:45.116562Z level=info msg="finished node evaluation" controller_id="" trace_id=66c82249ab80a585f4673630c254fec8 node_id=otelcol.exporter.otlp.default duration=1.442083ms
ts=2023-10-19T11:22:45.119437Z level=info msg="finished node evaluation" controller_id="" trace_id=66c82249ab80a585f4673630c254fec8 node_id=otelcol.processor.tail_sampling.default duration=2.865458ms
ts=2023-10-19T11:22:45.120004Z level=info msg="finished node evaluation" controller_id="" trace_id=66c82249ab80a585f4673630c254fec8 node_id=otelcol.processor.batch.default duration=558.166µs
ts=2023-10-19T11:22:45.120167Z level=info msg="finished node evaluation" controller_id="" trace_id=66c82249ab80a585f4673630c254fec8 node_id=otelcol.receiver.otlp.default duration=157.083µs
ts=2023-10-19T11:22:45.120179Z level=info msg="applying non-TLS config to HTTP server" service=http
ts=2023-10-19T11:22:45.120182Z level=info msg="finished node evaluation" controller_id="" trace_id=66c82249ab80a585f4673630c254fec8 node_id=http duration=9.625µs
ts=2023-10-19T11:22:45.120189Z level=info msg="finished node evaluation" controller_id="" trace_id=66c82249ab80a585f4673630c254fec8 node_id=ui duration=3.084µs
ts=2023-10-19T11:22:45.120196Z level=info msg="finished node evaluation" controller_id="" trace_id=66c82249ab80a585f4673630c254fec8 node_id=cluster duration=3µs
ts=2023-10-19T11:22:45.120206Z level=info msg="finished node evaluation" controller_id="" trace_id=66c82249ab80a585f4673630c254fec8 node_id=tracing duration=5.834µs
ts=2023-10-19T11:22:45.120213Z level=info msg="finished node evaluation" controller_id="" trace_id=66c82249ab80a585f4673630c254fec8 node_id=logging duration=3µs
ts=2023-10-19T11:22:45.120217Z level=info msg="finished complete graph evaluation" controller_id="" trace_id=66c82249ab80a585f4673630c254fec8 duration=5.935833ms
ts=2023-10-19T11:22:45.120247Z level=info msg="scheduling loaded components and services"
ts=2023-10-19T11:22:45.121121Z level=info msg="starting cluster node" peers="" advertise_addr=127.0.0.1:12345
ts=2023-10-19T11:22:45.122351Z level=info msg="peers changed" new_peers=192.168.1.101
ts=2023-10-19T11:22:45.122798Z level=info msg="finished node evaluation" controller_id="" node_id=otelcol.processor.batch.default duration=2.505083ms
ts=2023-10-19T11:22:45.12287Z level=info msg="finished node evaluation" controller_id="" node_id=otelcol.processor.tail_sampling.default duration=2.591ms
ts=2023-10-19T11:22:45.122935Z level=info msg="finished node evaluation" controller_id="" node_id=otelcol.exporter.otlp.default duration=2.66275ms
ts=2023-10-19T11:22:45.122948Z level=info msg="finished node evaluation" controller_id="" node_id=otelcol.receiver.otlp.default duration=137.458µs
ts=2023-10-19T11:22:45.123691Z level=info msg="now listening for http traffic" service=http addr=127.0.0.1:12345
ts=2023-10-19T11:22:45.123889Z level=warn msg="Using the 0.0.0.0 address exposes this server to every network interface, which may facilitate Denial of Service attacks" component=otelcol.receiver.otlp.default documentation=https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-against-denial-of-service-attacks
ts=2023-10-19T11:22:45.124956Z level=info msg="Starting GRPC server" component=otelcol.receiver.otlp.default endpoint=0.0.0.0:4320
^Cinterrupt received
ts=2023-10-19T11:24:17.989506Z level=error msg="failed to start reporter" err="context canceled"
ts=2023-10-19T11:24:17.989754Z level=info msg="component exited" component=otelcol.processor.tail_sampling.default
ts=2023-10-19T11:24:17.989828Z level=info msg="component exited" component=otelcol.auth.basic.grafana_cloud_tempo
panic: send on closed channel

goroutine 47 [running]:
github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor/internal/idbatcher.(*batcher).AddToCurrentBatch(0x14001b32fa0?, {0xe9, 0x44, 0xc4, 0xe8, 0xfc, 0xe4, 0x6d, 0x53, 0x93, ...})
    /Users/paulintodev/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor@v0.85.0/internal/idbatcher/id_batcher.go:107 +0x24
github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor.(*tailSamplingSpanProcessor).processTraces(0x140028b86e0, {0x104eee7ac?})
    /Users/paulintodev/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor@v0.85.0/processor.go:357 +0x274
github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor.(*tailSamplingSpanProcessor).ConsumeTraces(0x1194803f8?, {0x1125c4f18?, 0x18?}, {0x11079b0c0?})
    /Users/paulintodev/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor@v0.85.0/processor.go:305 +0x48
github.com/grafana/agent/component/otelcol/internal/lazyconsumer.(*Consumer).ConsumeTraces(0x14002880de0, {0x10c985e30, 0x140029a6120}, {0x3?})
    /Users/paulintodev/Documents/GitHub/agent/component/otelcol/internal/lazyconsumer/lazyconsumer.go:67 +0x130
go.opentelemetry.io/collector/processor/batchprocessor.(*batchTraces).export(0x140029a6150, {0x10c985e30, 0x140029a6120}, 0x104ee7c3c?, 0x0)
    /Users/paulintodev/go/pkg/mod/go.opentelemetry.io/collector/processor/batchprocessor@v0.85.0/batch_processor.go:407 +0x120
go.opentelemetry.io/collector/processor/batchprocessor.(*shard).sendItems(0x14002898d00, 0x14002b23f2c?)
    /Users/paulintodev/go/pkg/mod/go.opentelemetry.io/collector/processor/batchprocessor@v0.85.0/batch_processor.go:256 +0x54
go.opentelemetry.io/collector/processor/batchprocessor.(*shard).start(0x14002898d00)
    /Users/paulintodev/go/pkg/mod/go.opentelemetry.io/collector/processor/batchprocessor@v0.85.0/batch_processor.go:208 +0x1e0
created by go.opentelemetry.io/collector/processor/batchprocessor.(*batchProcessor).newShard in goroutine 1
    /Users/paulintodev/go/pkg/mod/go.opentelemetry.io/collector/processor/batchprocessor@v0.85.0/batch_processor.go:160 +0x19c
rfratto commented 7 months ago

Hi there :wave:

On April 9, 2024, Grafana Labs announced Grafana Alloy, the spirital successor to Grafana Agent and the final form of Grafana Agent flow mode. As a result, Grafana Agent has been deprecated and will only be receiving bug and security fixes until its end-of-life around November 1, 2025.

To make things easier for maintainers, we're in the process of migrating all issues tagged variant/flow to the Grafana Alloy repository to have a single home for tracking issues. This issue is likely something we'll want to address in both Grafana Alloy and Grafana Agent, so just because it's being moved doesn't mean we won't address the issue in Grafana Agent :)