Closed tcoville-coveo closed 8 months ago
I also have issues with 0.92.0 behind AWS ALB, which worked fine in all previous versions I've ran (including 0.91.0), where mine just fails the health checks and I don't get why..
But I do seem to get the same error message as you, when I restart the service (via systemd).
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]: panic: runtime error: invalid memory address or nil pointer dereference
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0xd37fa8]
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]: goroutine 318 [running]:
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]: go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.(*config).>
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]: go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@v0>
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]: go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.(*serverHa>
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]: go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@v0>
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]: google.golang.org/grpc/internal/transport.(*http2Server).WriteStatus(0x4002b22340, 0x4>
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]: google.golang.org/grpc@v1.60.1/internal/transport/http2_server.go:1071 +0x8dc
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]: google.golang.org/grpc.(*Server).handleStream(0x4001fb7600, {0x852a3a0, 0x4002b22340},>
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]: google.golang.org/grpc@v1.60.1/server.go:1749 +0x474
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]: google.golang.org/grpc.(*Server).serveStreams.func2.1()
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]: google.golang.org/grpc@v1.60.1/server.go:1016 +0x5c
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]: created by google.golang.org/grpc.(*Server).serveStreams.func2 in goroutine 95
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]: google.golang.org/grpc@v1.60.1/server.go:1027 +0x138
Jan 16 19:22:53 opentelemetry-collector-338 systemd[1]: otelcol-contrib.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
I'm running this on al2023 (but I have seen the same issue on amzn2 using x86 as well) cat /etc/os-release
NAME="Amazon Linux"
VERSION="2023"
ID="amzn"
ID_LIKE="fedora"
VERSION_ID="2023"
PLATFORM_ID="platform:al2023"
PRETTY_NAME="Amazon Linux 2023"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023"
HOME_URL="https://aws.amazon.com/linux/"
BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023"
SUPPORT_END="2028-03-01"
uname -a
Linux opentelemetry-collector-338 6.1.56-82.125.amzn2023.aarch64 #1 SMP Tue Oct 10 17:03:29 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux
config:
receivers:
otlp:
protocols:
grpc:
exporters:
loadbalancing:
protocol:
otlp:
timeout: 5s
sending_queue:
queue_size: 10000
tls:
insecure: true
resolver:
dns:
hostname: otelsampler.etc
timeout: 3s
processors:
memory_limiter:
check_interval: 1s
limit_percentage: 80
spike_limit_percentage: 30
batch:
send_batch_max_size: 12288
timeout: 5s
extensions:
health_check:
zpages:
service:
extensions: [zpages, health_check]
pipelines:
traces/otlp:
receivers: [otlp]
exporters: [loadbalancing]
processors: [memory_limiter, batch]
metrics:
receivers: [otlp]
exporters: [loadbalancing]
processors: [memory_limiter, batch]
logs:
receivers: [otlp]
exporters: [loadbalancing]
processors: [memory_limiter, batch]
Thanks for reporting! Might be a regression go OTel Go GRPC instrumentation. Looking into it.
cc @open-telemetry/go-instrumentation-approvers
Some findings:
telemetrygen traces
.The diff between v1.59.0 and v1.60.0 has these two PRs which are potentially related to the instrumentation code: grpc/grpc-go/pull/6716 and grpc/grpc-go/pull/6750. 6176 looks more suspicious.
I don't know why the crash happens, but open-telemetry/opentelemetry-go-contrib/pull/4825 would prevent it and it follows existing practice in grpc-go, so I think this would at least make the issue less serious.
@tcoville-coveo @neeej can you confirm that restarting the collector w/ the following feature gate prevents the issue from occurring:
--feature-gates -telemetry.useOtelForInternalMetrics
I'm wondering if this issue is caused by the change to using otel for internal metrics, but it might not be
@codeboten I tried what you suggested but I didn't notice any difference, the container is still crashing with the same error.
I didn't see any mention of the flag in the logs, how can I make sure that it has actually been applied ?
I managed to ping zpages before the container restart and I confirm the flag is disabled
FYI i deploy via the operator so I added the following config in the CRD:
args:
feature-gates: "-telemetry.useOtelForInternalMetrics"
which results in the pod args generated as:
apiVersion: v1
kind: Pod
[...]
spec:
containers:
- name: otc-container
args:
- --config=/conf/collector.yaml
- --feature-gates=-telemetry.useOtelForInternalMetrics
[...]
I also get the same error on restarts with that setting.
otelcol+ 5870 2.5 3.3 1455156 133792 ? Ssl 06:31 0:00 /usr/bin/otelcol-contrib --config=/etc/otelcol-contrib/config.yaml --feature-gates=-telemetry.useOtelForInternalMetrics
Filed grpc/grpc-go/issues/6928
I think we should not close this issue until https://github.com/grpc/grpc-go/issues/6928 is resolved. Or file another one. It sounds potentially pretty impactful. Maybe even worth putting a release blocker label. @codeboten @codeboten WDYT?
Hello, I replied in the issue in gRPC-Go, but is there a way to reproduce this? I wrote a test locally and triaged this repo's stats handler code and it all seems to be in good order (my test had no problem accessing context values in the server side trailers case from WriteStatus). Here to help, hopefully we can get to the bottom of this.
@dmitryax i think its fine to keep it opened until its been confirmed to be fixed
@tcoville-coveo @neeej the latest dev build of contrib is available, would you be able to test it out and see if the original crash occurs? No worries if not, we just haven't been able to repro on our end. The docker tag is:
otel/opentelemetry-collector-contrib-dev:bcc7163f7a8a7c2611fe74ba23f998a7fc0889a2
@codeboten YES ! the issue is fixed with the dev
build 🥳 🎉
Thank you so much to all of you for prioritizing this and fixing the issue so quickly !
Thanks @mx-psi for getting to the bottom of this so quickly! @dmitryax @mx-psi did you want to leave this open until the release or mark it closed?
Thank you so much to all of you for prioritizing this and fixing the issue so quickly !
Thanks to you both for providing us with feedback and detailed information about this as well :)
While the issue is fixed at the Collector level, it would be great to also fix this at grpc-go. It would be extremely helpful if you can help us reproduce this locally @tcoville-coveo @neeej. To this end, I think the following would be helpful:
telemetry::metrics
, max_recv_msg_size_mib
...)Anything else you can think of that would help is also welcome :)
did you want to leave this open until the release or mark it closed?
I think this issue can be marked as solved. I would still want to keep the discussion going to help the folks at grpc-go and prevent further similar issues.
I have an AWS ALB configured which has a listener on HTTPS:4317 (and a certificate connected to it). That ALB forwards to a AWS TG which is defined with port HTTP:4317 and grpc as protocol.
The TG has a health check configured to check for valid grpc codes 0-99 The AMIs launched for this via ASG (connected to the TG) are prebuilt on al2023 with otelcol-contrib installed which are then configured on instance boot (this is so I can scale in/out).
I use a form of this as my AWS CF template, if that might help: https://pastebin.pl/view/15d2c892
I looked a bit further on my logs, and it does crash the same as @tcoville-coveo It's just systemd that keeps it running, so it looks like it runs as it should, except it restarts all the time.
So, I stripped my config to this, and I still see the same issue:
receivers:
otlp:
protocols:
grpc:
exporters:
otlp:
endpoint: "https://otlp.domain:4317"
service:
extensions: []
pipelines:
traces/otlp:
receivers: [otlp]
exporters: [otlp]
processors: []
I also tried installing the otelcol rpm instead, used above config and that also gave me the same result.
I tried running otelcol-contrib with one time each of these below as options, and got the same result then as well:
OTELCOL_OPTIONS="--config=/etc/otelcol-contrib/config.yaml --feature-gates=-telemetry.disableHighCardinalityMetrics"
OTELCOL_OPTIONS="--config=/etc/otelcol-contrib/config.yaml --feature-gates=-telemetry.useOtelForInternalMetrics"
OTELCOL_OPTIONS="--config=/etc/otelcol-contrib/config.yaml --feature-gates=-telemetry.useOtelWithSDKConfigurationForInternalTelemetry"
Then I installed https://github.com/rmedvedev/grpcdump and got the below output from it:
TEST [root@opentelemetry-collector v338 eu-west-1: grpcdump]$ go run cmd/grpcdump/main.go -i ens5 -p 4317
INFO[0000] Starting sniff ethernet packets at interface ens5 on port 4317
172.27.182.41:16706 -> 172.27.182.110:4317 / map[:method:POST :path:/ :scheme:http accept-encoding:gzip, compressed content-length:5 content-type:application/grpc host:AWSALB te:trailers user-agent:ELB-HealthChecker/2.0]
172.27.182.8:38552 -> 172.27.182.110:4317 / map[:method:POST :path:/ :scheme:http accept-encoding:gzip, compressed content-length:5 content-type:application/grpc host:AWSALB te:trailers user-agent:ELB-HealthChecker/2.0]
ERRO[0003] Error in try to read http2: Error reading frame: unexpected EOF
172.27.182.41:16716 -> 172.27.182.110:4317 / map[:method:POST :path:/ :scheme:http accept-encoding:gzip, compressed content-length:5 content-type:application/grpc host:AWSALB te:trailers user-agent:ELB-HealthChecker/2.0]
ERRO[0006] Error in try to read http2: Error reading frame: unexpected EOF
172.27.182.41:3594 -> 172.27.182.110:4317 / map[:method:POST :path:/ :scheme:http accept-encoding:gzip, compressed content-length:5 content-type:application/grpc host:AWSALB te:trailers user-agent:ELB-HealthChecker/2.0]
172.27.182.110:4317 -> 172.27.182.41:3594 <nil> map[:status:200 content-type:application/grpc grpc-message:malformed method name: "/" grpc-status:12]
ERRO[0022] Error in try to read http2: Error reading frame: unexpected EOF
Which when compared to the output I get from an instance running with 0.91.0 (and with the config from my first post in this issue), it looks very different:
INFO[0000] Starting sniff ethernet packets at interface ens5 on port 4317
172.27.182.8:59638 -> 172.27.182.125:4317 / map[:method:POST :path:/ :scheme:http accept-encoding:gzip, compressed content-length:5 content-type:application/grpc host:AWSALB te:trailers user-agent:ELB-HealthChecker/2.0]
172.27.182.125:4317 -> 172.27.182.8:59638 <nil> map[:status:200 content-type:application/grpc grpc-message:malformed method name: "/" grpc-status:12]
172.27.182.41:49472 -> 172.27.182.125:4317 / map[:method:POST :path:/ :scheme:http accept-encoding:gzip, compressed content-length:5 content-type:application/grpc host:AWSALB te:trailers user-agent:ELB-HealthChecker/2.0]
172.27.182.125:4317 -> 172.27.182.41:49472 <nil> map[:status:200 content-type:application/grpc grpc-message:malformed method name: "/" grpc-status:12]
172.27.182.8:59640 -> 172.27.182.125:4317 / map[:method:POST :path:/ :scheme:http accept-encoding:gzip, compressed content-length:5 content-type:application/grpc host:AWSALB te:trailers user-agent:ELB-HealthChecker/2.0]
172.27.182.125:4317 -> 172.27.182.8:59640 <nil> map[:status:200 content-type:application/grpc grpc-message:malformed method name: "/" grpc-status:12]
172.27.182.41:49488 -> 172.27.182.125:4317 / map[:method:POST :path:/ :scheme:http accept-encoding:gzip, compressed content-length:5 content-type:application/grpc host:AWSALB te:trailers user-agent:ELB-HealthChecker/2.0]
172.27.182.125:4317 -> 172.27.182.41:49488 <nil> map[:status:200 content-type:application/grpc grpc-message:malformed method name: "/" grpc-status:12]
172.27.182.125:4317 -> 172.27.182.41:49494 map[:status:200 content-type:application/grpc grpc-message: grpc-status:0]
172.27.182.8:43914 -> 172.27.182.125:4317 / map[:method:POST :path:/ :scheme:http accept-encoding:gzip, compressed content-length:5 content-type:application/grpc host:AWSALB te:trailers user-agent:ELB-HealthChecker/2.0]
172.27.182.125:4317 -> 172.27.182.8:43914 <nil> map[:status:200 content-type:application/grpc grpc-message:malformed method name: "/" grpc-status:12]
172.27.182.41:5922 -> 172.27.182.125:4317 / map[:method:POST :path:/ :scheme:http accept-encoding:gzip, compressed content-length:5 content-type:application/grpc host:AWSALB te:trailers user-agent:ELB-HealthChecker/2.0]
Let me know if I should test something more.
Thanks @neeej, this is helpful! cc @zasweq in case the log messages give you any hint
I confirm that the issue is fixed for me on 0.93.0 🎉
I am going to close this as done, further discussion can happen over at https://github.com/grpc/grpc-go/issues/6928. Thanks again everybody :)
Describe the bug Collector is crashing continuously a few seconds after starting up after upgrading to
0.92.0
Steps to reproduce I have several collectors running with the same version but only one is having the issue. the only difference I can see is that it is used as a Gateway behind an AWS ALB. the ALB healthcheck in GRPC polls on the
/
path of the receiver and the failure seem to correlate with the probe frequency. Also noted that if I remove the ingress the error stops (along with the traffic so not sure if this proves anything....) I also tried to downgrade the relay collector to 0.91.0 and send the traces "as-is" to a second collector running on0.92.0
with the same config but without ingress and I don't get the error so it would tend to confirm the ALB probe theory....What did you expect to see? The collector should at least survive the error and log a meaningful message
What did you see instead?
What version did you use? otel-operator: 0.91.0 , collector image: 0.92.0
What config did you use? collector:
ingress:
Environment otel-contrib image
Additional context I did not manage to repro locally on my machine, the issue only seem to happen on the cluster