open-telemetry / opentelemetry-collector

OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
4.36k stars 1.44k forks source link

Collector crashing with `invalid memory address or nil pointer dereference` on collector exposing a `GRPC` endpoint behind an AWS ALB #9296

Closed tcoville-coveo closed 8 months ago

tcoville-coveo commented 9 months ago

Describe the bug Collector is crashing continuously a few seconds after starting up after upgrading to 0.92.0

Steps to reproduce I have several collectors running with the same version but only one is having the issue. the only difference I can see is that it is used as a Gateway behind an AWS ALB. the ALB healthcheck in GRPC polls on the / path of the receiver and the failure seem to correlate with the probe frequency. Also noted that if I remove the ingress the error stops (along with the traffic so not sure if this proves anything....) I also tried to downgrade the relay collector to 0.91.0 and send the traces "as-is" to a second collector running on 0.92.0 with the same config but without ingress and I don't get the error so it would tend to confirm the ALB probe theory....

What did you expect to see? The collector should at least survive the error and log a meaningful message

What did you see instead?

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1235929]

goroutine 167 [running]:
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.(*config).handleRPC(0xc002a0e9c0, {0x902ac80, 0xc002aaa510}, {0x8ff2fd0?, 0xc0020fba88?})
    go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@v0.46.1/stats_handler.go:144 +0xa9
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.(*serverHandler).HandleRPC(0xc00232a340?, {0x902ac80?, 0xc002aaa510?}, {0x8ff2fd0?, 0xc0020fba88?})
    go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@v0.46.1/stats_handler.go:88 +0x2a
google.golang.org/grpc/internal/transport.(*http2Server).WriteStatus(0xc00232a340, 0xc002a0c480, 0xc002b3c340)
    google.golang.org/grpc@v1.60.1/internal/transport/http2_server.go:1071 +0xaf2
google.golang.org/grpc.(*Server).handleStream(0xc002858c00, {0x905cfe0, 0xc00232a340}, 0xc002a0c480)
    google.golang.org/grpc@v1.60.1/server.go:1749 +0x575
google.golang.org/grpc.(*Server).serveStreams.func2.1()
    google.golang.org/grpc@v1.60.1/server.go:1016 +0x59
created by google.golang.org/grpc.(*Server).serveStreams.func2 in goroutine 166
    google.golang.org/grpc@v1.60.1/server.go:1027 +0x115

What version did you use? otel-operator: 0.91.0 , collector image: 0.92.0

What config did you use? collector:

"exporters":
  "logging": null
  "otlp/central":
    "endpoint": "https://xxxxxxxxxxxxxx"
"extensions":
  "health_check": {}
  "zpages": {}
"processors":
  "batch":
    "send_batch_max_size": 15000
    "send_batch_size": 7500
    "timeout": "200ms"
  "memory_limiter":
    "check_interval": "1s"
    "limit_percentage": 75
    "spike_limit_percentage": 15
  "transform":
    "metric_statements":
    - "context": "metric"
      "statements":
      - "set(resource.attributes[\"meta.collector\"], \"relay-collector\")"
    "trace_statements":
    - "context": "spanevent"
      "statements":
      - "set(resource.attributes[\"meta.collector\"], \"relay-collector\")"
    - "context": "span"
      "statements":
      - "set(resource.attributes[\"meta.collector\"], \"relay-collector\")"
"receivers":
  "otlp":
    "protocols":
      "grpc":
        "endpoint": "0.0.0.0:4317"
        "max_recv_msg_size_mib": 15
        "tls":
          "cert_file": "/certificate/tls.crt"
          "key_file": "/certificate/tls.key"
"service":
  "extensions":
  - "health_check"
  - "zpages"
  "pipelines":
    "traces":
      "exporters":
      - "otlp/central"
      "processors":
      - "memory_limiter"
      - "transform"
      - "batch"
      "receivers":
      - "otlp"
  "telemetry":
    "logs":
      "encoding": "json"
      "level": "info"
    "metrics":
      "address": ":8888"

ingress:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: relay-collector
  namespace: '${config.namespace}'
  annotations:
    "alb.ingress.kubernetes.io/backend-protocol": "HTTPS"
    "alb.ingress.kubernetes.io/backend-protocol-version": "GRPC"
    "alb.ingress.kubernetes.io/target-type": "ip"
    "alb.ingress.kubernetes.io/healthcheck-path": "/"
    "alb.ingress.kubernetes.io/healthcheck-protocol": "HTTPS"
    "alb.ingress.kubernetes.io/healthcheck-interval-seconds": "5"
    "alb.ingress.kubernetes.io/healthcheck-timeout-seconds": "4"
    "alb.ingress.kubernetes.io/unhealthy-threshold-count": "3"
    "alb.ingress.kubernetes.io/success-codes": "0-99"
    "alb.ingress.kubernetes.io/ssl-policy": "ELBSecurityPolicy-TLS-1-2-2017-01"
    "alb.ingress.kubernetes.io/inbound-cidrs": "${config.inbound_cidrs}"
    "alb.ingress.kubernetes.io/certificate-arn": "${config.certificate}"
    "alb.ingress.kubernetes.io/listen-ports": "[{\"HTTPS\": 443 }]"
    "alb.ingress.kubernetes.io/scheme": "internal"
spec:
  ingressClassName: albv2
  rules:
    - host: ${config.hostname}
      http:
        paths:
          - path: /*
            pathType: ImplementationSpecific
            backend:
              service:
                name: relay-collector
                port:
                  name: otlp-grpc

Environment otel-contrib image

Additional context I did not manage to repro locally on my machine, the issue only seem to happen on the cluster

neeej commented 9 months ago

I also have issues with 0.92.0 behind AWS ALB, which worked fine in all previous versions I've ran (including 0.91.0), where mine just fails the health checks and I don't get why..

But I do seem to get the same error message as you, when I restart the service (via systemd).

Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]: panic: runtime error: invalid memory address or nil pointer dereference
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0xd37fa8]
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]: goroutine 318 [running]:
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]: go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.(*config).>
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]:         go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@v0>
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]: go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.(*serverHa>
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]:         go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@v0>
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]: google.golang.org/grpc/internal/transport.(*http2Server).WriteStatus(0x4002b22340, 0x4>
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]:         google.golang.org/grpc@v1.60.1/internal/transport/http2_server.go:1071 +0x8dc
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]: google.golang.org/grpc.(*Server).handleStream(0x4001fb7600, {0x852a3a0, 0x4002b22340},>
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]:         google.golang.org/grpc@v1.60.1/server.go:1749 +0x474
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]: google.golang.org/grpc.(*Server).serveStreams.func2.1()
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]:         google.golang.org/grpc@v1.60.1/server.go:1016 +0x5c
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]: created by google.golang.org/grpc.(*Server).serveStreams.func2 in goroutine 95
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]:         google.golang.org/grpc@v1.60.1/server.go:1027 +0x138
Jan 16 19:22:53 opentelemetry-collector-338 systemd[1]: otelcol-contrib.service: Main process exited, code=exited, status=2/INVALIDARGUMENT

I'm running this on al2023 (but I have seen the same issue on amzn2 using x86 as well) cat /etc/os-release

NAME="Amazon Linux"
VERSION="2023"
ID="amzn"
ID_LIKE="fedora"
VERSION_ID="2023"
PLATFORM_ID="platform:al2023"
PRETTY_NAME="Amazon Linux 2023"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023"
HOME_URL="https://aws.amazon.com/linux/"
BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023"
SUPPORT_END="2028-03-01"

uname -a

Linux opentelemetry-collector-338 6.1.56-82.125.amzn2023.aarch64 #1 SMP Tue Oct 10 17:03:29 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux

config:

receivers:
  otlp:
    protocols:
      grpc:

exporters:
  loadbalancing:
    protocol:
      otlp:
        timeout: 5s
        sending_queue:
          queue_size: 10000
        tls:
          insecure: true
    resolver:
      dns:
        hostname: otelsampler.etc
        timeout: 3s

processors:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 80
    spike_limit_percentage: 30
  batch:
    send_batch_max_size: 12288
    timeout: 5s

extensions:
  health_check:
  zpages:

service:
  extensions: [zpages, health_check]
  pipelines:
    traces/otlp:
      receivers: [otlp]
      exporters: [loadbalancing]
      processors: [memory_limiter, batch]
    metrics:
      receivers: [otlp]
      exporters: [loadbalancing]
      processors: [memory_limiter, batch]
    logs:
      receivers: [otlp]
      exporters: [loadbalancing]
      processors: [memory_limiter, batch]
dmitryax commented 9 months ago

Thanks for reporting! Might be a regression go OTel Go GRPC instrumentation. Looking into it.

cc @open-telemetry/go-instrumentation-approvers

mx-psi commented 9 months ago

Some findings:

mx-psi commented 9 months ago

The diff between v1.59.0 and v1.60.0 has these two PRs which are potentially related to the instrumentation code: grpc/grpc-go/pull/6716 and grpc/grpc-go/pull/6750. 6176 looks more suspicious.

mx-psi commented 9 months ago

I don't know why the crash happens, but open-telemetry/opentelemetry-go-contrib/pull/4825 would prevent it and it follows existing practice in grpc-go, so I think this would at least make the issue less serious.

codeboten commented 9 months ago

@tcoville-coveo @neeej can you confirm that restarting the collector w/ the following feature gate prevents the issue from occurring:

--feature-gates -telemetry.useOtelForInternalMetrics

I'm wondering if this issue is caused by the change to using otel for internal metrics, but it might not be

tcoville-coveo commented 9 months ago

@codeboten I tried what you suggested but I didn't notice any difference, the container is still crashing with the same error. I didn't see any mention of the flag in the logs, how can I make sure that it has actually been applied ? I managed to ping zpages before the container restart and I confirm the flag is disabled

image

FYI i deploy via the operator so I added the following config in the CRD:

  args:
    feature-gates: "-telemetry.useOtelForInternalMetrics"

which results in the pod args generated as:

apiVersion: v1
kind: Pod
[...]
spec:
  containers:
  - name: otc-container
    args:
    - --config=/conf/collector.yaml
    - --feature-gates=-telemetry.useOtelForInternalMetrics
[...]
neeej commented 9 months ago

I also get the same error on restarts with that setting.

otelcol+    5870  2.5  3.3 1455156 133792 ?      Ssl  06:31   0:00 /usr/bin/otelcol-contrib --config=/etc/otelcol-contrib/config.yaml --feature-gates=-telemetry.useOtelForInternalMetrics
mx-psi commented 9 months ago

Filed grpc/grpc-go/issues/6928

dmitryax commented 9 months ago

I think we should not close this issue until https://github.com/grpc/grpc-go/issues/6928 is resolved. Or file another one. It sounds potentially pretty impactful. Maybe even worth putting a release blocker label. @codeboten @codeboten WDYT?

zasweq commented 9 months ago

Hello, I replied in the issue in gRPC-Go, but is there a way to reproduce this? I wrote a test locally and triaged this repo's stats handler code and it all seems to be in good order (my test had no problem accessing context values in the server side trailers case from WriteStatus). Here to help, hopefully we can get to the bottom of this.

codeboten commented 9 months ago

@dmitryax i think its fine to keep it opened until its been confirmed to be fixed

codeboten commented 9 months ago

@tcoville-coveo @neeej the latest dev build of contrib is available, would you be able to test it out and see if the original crash occurs? No worries if not, we just haven't been able to repro on our end. The docker tag is:

otel/opentelemetry-collector-contrib-dev:bcc7163f7a8a7c2611fe74ba23f998a7fc0889a2
tcoville-coveo commented 9 months ago

@codeboten YES ! the issue is fixed with the dev build 🥳 🎉 Thank you so much to all of you for prioritizing this and fixing the issue so quickly !

codeboten commented 9 months ago

Thanks @mx-psi for getting to the bottom of this so quickly! @dmitryax @mx-psi did you want to leave this open until the release or mark it closed?

mx-psi commented 9 months ago

Thank you so much to all of you for prioritizing this and fixing the issue so quickly !

Thanks to you both for providing us with feedback and detailed information about this as well :)

While the issue is fixed at the Collector level, it would be great to also fix this at grpc-go. It would be extremely helpful if you can help us reproduce this locally @tcoville-coveo @neeej. To this end, I think the following would be helpful:

Anything else you can think of that would help is also welcome :)

did you want to leave this open until the release or mark it closed?

I think this issue can be marked as solved. I would still want to keep the discussion going to help the folks at grpc-go and prevent further similar issues.

neeej commented 9 months ago

I have an AWS ALB configured which has a listener on HTTPS:4317 (and a certificate connected to it). That ALB forwards to a AWS TG which is defined with port HTTP:4317 and grpc as protocol.

The TG has a health check configured to check for valid grpc codes 0-99 The AMIs launched for this via ASG (connected to the TG) are prebuilt on al2023 with otelcol-contrib installed which are then configured on instance boot (this is so I can scale in/out).

I use a form of this as my AWS CF template, if that might help: https://pastebin.pl/view/15d2c892

I looked a bit further on my logs, and it does crash the same as @tcoville-coveo It's just systemd that keeps it running, so it looks like it runs as it should, except it restarts all the time.

So, I stripped my config to this, and I still see the same issue:

receivers:
  otlp:
    protocols:
      grpc:

exporters:
  otlp:
    endpoint: "https://otlp.domain:4317"

service:
  extensions: []
  pipelines:
    traces/otlp:
      receivers: [otlp]
      exporters: [otlp]
      processors: []

I also tried installing the otelcol rpm instead, used above config and that also gave me the same result.

I tried running otelcol-contrib with one time each of these below as options, and got the same result then as well:

OTELCOL_OPTIONS="--config=/etc/otelcol-contrib/config.yaml --feature-gates=-telemetry.disableHighCardinalityMetrics"
OTELCOL_OPTIONS="--config=/etc/otelcol-contrib/config.yaml --feature-gates=-telemetry.useOtelForInternalMetrics"
OTELCOL_OPTIONS="--config=/etc/otelcol-contrib/config.yaml --feature-gates=-telemetry.useOtelWithSDKConfigurationForInternalTelemetry"

Then I installed https://github.com/rmedvedev/grpcdump and got the below output from it:

TEST [root@opentelemetry-collector v338 eu-west-1: grpcdump]$ go run cmd/grpcdump/main.go -i ens5 -p 4317
INFO[0000] Starting sniff ethernet packets at interface ens5 on port 4317
172.27.182.41:16706 -> 172.27.182.110:4317 /  map[:method:POST :path:/ :scheme:http accept-encoding:gzip, compressed content-length:5 content-type:application/grpc host:AWSALB te:trailers user-agent:ELB-HealthChecker/2.0]
172.27.182.8:38552 -> 172.27.182.110:4317 /  map[:method:POST :path:/ :scheme:http accept-encoding:gzip, compressed content-length:5 content-type:application/grpc host:AWSALB te:trailers user-agent:ELB-HealthChecker/2.0]
ERRO[0003] Error in try to read http2: Error reading frame: unexpected EOF
172.27.182.41:16716 -> 172.27.182.110:4317 /  map[:method:POST :path:/ :scheme:http accept-encoding:gzip, compressed content-length:5 content-type:application/grpc host:AWSALB te:trailers user-agent:ELB-HealthChecker/2.0]
ERRO[0006] Error in try to read http2: Error reading frame: unexpected EOF
172.27.182.41:3594 -> 172.27.182.110:4317 /  map[:method:POST :path:/ :scheme:http accept-encoding:gzip, compressed content-length:5 content-type:application/grpc host:AWSALB te:trailers user-agent:ELB-HealthChecker/2.0]
172.27.182.110:4317 -> 172.27.182.41:3594  <nil> map[:status:200 content-type:application/grpc grpc-message:malformed method name: "/" grpc-status:12]
ERRO[0022] Error in try to read http2: Error reading frame: unexpected EOF

Which when compared to the output I get from an instance running with 0.91.0 (and with the config from my first post in this issue), it looks very different:

INFO[0000] Starting sniff ethernet packets at interface ens5 on port 4317
172.27.182.8:59638 -> 172.27.182.125:4317 /  map[:method:POST :path:/ :scheme:http accept-encoding:gzip, compressed content-length:5 content-type:application/grpc host:AWSALB te:trailers user-agent:ELB-HealthChecker/2.0]
172.27.182.125:4317 -> 172.27.182.8:59638  <nil> map[:status:200 content-type:application/grpc grpc-message:malformed method name: "/" grpc-status:12]
172.27.182.41:49472 -> 172.27.182.125:4317 /  map[:method:POST :path:/ :scheme:http accept-encoding:gzip, compressed content-length:5 content-type:application/grpc host:AWSALB te:trailers user-agent:ELB-HealthChecker/2.0]
172.27.182.125:4317 -> 172.27.182.41:49472  <nil> map[:status:200 content-type:application/grpc grpc-message:malformed method name: "/" grpc-status:12]
172.27.182.8:59640 -> 172.27.182.125:4317 /  map[:method:POST :path:/ :scheme:http accept-encoding:gzip, compressed content-length:5 content-type:application/grpc host:AWSALB te:trailers user-agent:ELB-HealthChecker/2.0]
172.27.182.125:4317 -> 172.27.182.8:59640  <nil> map[:status:200 content-type:application/grpc grpc-message:malformed method name: "/" grpc-status:12]
172.27.182.41:49488 -> 172.27.182.125:4317 /  map[:method:POST :path:/ :scheme:http accept-encoding:gzip, compressed content-length:5 content-type:application/grpc host:AWSALB te:trailers user-agent:ELB-HealthChecker/2.0]
172.27.182.125:4317 -> 172.27.182.41:49488  <nil> map[:status:200 content-type:application/grpc grpc-message:malformed method name: "/" grpc-status:12]
172.27.182.125:4317 -> 172.27.182.41:49494   map[:status:200 content-type:application/grpc grpc-message: grpc-status:0]
172.27.182.8:43914 -> 172.27.182.125:4317 /  map[:method:POST :path:/ :scheme:http accept-encoding:gzip, compressed content-length:5 content-type:application/grpc host:AWSALB te:trailers user-agent:ELB-HealthChecker/2.0]
172.27.182.125:4317 -> 172.27.182.8:43914  <nil> map[:status:200 content-type:application/grpc grpc-message:malformed method name: "/" grpc-status:12]
172.27.182.41:5922 -> 172.27.182.125:4317 /  map[:method:POST :path:/ :scheme:http accept-encoding:gzip, compressed content-length:5 content-type:application/grpc host:AWSALB te:trailers user-agent:ELB-HealthChecker/2.0]

Let me know if I should test something more.

mx-psi commented 9 months ago

Thanks @neeej, this is helpful! cc @zasweq in case the log messages give you any hint

tcoville-coveo commented 8 months ago

I confirm that the issue is fixed for me on 0.93.0 🎉

mx-psi commented 8 months ago

I am going to close this as done, further discussion can happen over at https://github.com/grpc/grpc-go/issues/6928. Thanks again everybody :)