Closed aabmass closed 8 months ago
Thanks for reporting this. I am away this week, but I will take a look at this on Monday. Do you know what ports are used for the connection? Are they by any chance above 32,000? I am asking because sometimes when we initially attach we see network events, without first seeing the accept or the connect. So we try to guess if it was client or server based on the ports used, assuming the client side ephemeral port is higher.
It should be only one event like this, but if you see more then it's a different bug.
Thanks for the quick reply
I am asking because sometimes when we initially attach we see network events, without first seeing the accept or the connect. So we try to guess if it was client or server based on the ports used, assuming the client side ephemeral port is higher.
Ah interesting. The server is listening on 9080 which is correctly reported in the span as server.port: Int(9080)
. I am seeing a few spans like this but it does subside within a few minutes.
Thanks for making that great guide, it was very easy to follow. I have been unable to reproduce it, but I think I know what the issue is and I have a fix in mind. I'm going to work on this today, but if I don't get a fix in, I should have time to finish this on Wednesday.
I made a fix attempt. I hope this resolves the issue, if it does we can then manually close this issue. If not, I'll keep on digging.
I figured it might be tough to reproduce 🙂
I made a fix attempt. I hope this resolves the issue, if it does we can then manually close this issue. If not, I'll keep on digging.
That's awesome, thank you! Let me try it out from the PR and get back to you
@grcevski I built a new docker image with just docker build -t <tag> .
from the repo root in your PR. Beyla is crashing with logs like
$ kubectl logs -n beyla-repro daemonsets/beyla-agent | head -100
Found 3 pods, using pod/beyla-agent-6fznz
time=2024-03-18T19:41:59.759Z level=INFO msg="Grafana Beyla" Version=v1.4.0-7-gd01d645d Revision=d01d645d "OpenTelemetry SDK Version"=1.23.1
time=2024-03-18T19:41:59.760Z level=INFO msg="starting Beyla in Application Observability mode"
time=2024-03-18T19:41:59.870Z level=INFO msg="using hostname" component=traces.ReadDecorator function=instance_ID_hostNamePIDDecorator hostname=beyla-agent-6fznz
time=2024-03-18T19:41:59.870Z level=WARN msg="No route match patterns configured. Without route definitions Beyla will not be able to generate a low cardinality route for trace span names. For optimal experience, please define your application HTTP route patterns or enable the route 'heuristic' mode. For more information please see the documentation at: https://grafana.com
/docs/beyla/latest/configure/options/#routes-decorator . If your application is only using gRPC you can ignore this warning." component=RoutesProvider
time=2024-03-18T19:41:59.873Z level=INFO msg="Starting main node" component=beyla.Instrumenter
time=2024-03-18T19:42:00.535Z level=INFO msg="instrumenting process" component=discover.TraceAttacher cmd=/beyla pid=3192191
time=2024-03-18T19:42:05.069Z level=ERROR msg="couldn't trace process. Stopping process tracer" component=ebpf.ProcessTracer path=/beyla pid=3192191 error="loading and assigning BPF objects: field KprobeTcpSendmsg: program kprobe_tcp_sendmsg: load program: argument list too long: 892: (79) r1 = *(u64 *)(r7 +312): R0_w=mem(id=0,ref_obj_id=27,off=0,imm=0) R1_w=inv(id=0) R6=i
nv(id=24,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R7=map_value(id=0,off=0,ks=44,vs=448,imm=0) R8=inv(id=0) R9_w=mem(id=0,ref_obj_id=27,off=0,imm=0) R10=fp0 fp-16=mmmmmmmm fp-40=????mmmm fp-48=???????m fp-56=mmmmmmmm fp-64=????mmmm fp-72=mmmmmmmm fp-80=mmmmmmmm fp-88=mmmmmmmm fp-96=mmmmmmmm fp-104=mmmmmmmm fp-112=mmmmmmmm fp-120=mmmmmmmm fp-128=mmmmmmmm fp-152=mmmmm
mmm fp-160=mmmmmmmm fp-184=mmmmmmmm fp-192=mmmmmmmm fp-200=mmmmmmmm fp-208=mmmmmmmm fp-216=mmmmmmmm fp-224=mmmmmmmm fp-232=mmmmmmmm fp-240=00000000 fp-248=mmmmm (truncated, 734 line(s) omitted)"
Error Log:
; int BPF_KPROBE(kprobe_tcp_sendmsg, struct sock *sk, struct msghdr *msg, size_t size) {
0: (79) r9 = *(u64 *)(r1 +96)
1: (79) r7 = *(u64 *)(r1 +104)
2: (79) r8 = *(u64 *)(r1 +112)
; u64 id = bpf_get_current_pid_tgid();
3: (85) call bpf_get_current_pid_tgid#14
4: (bf) r6 = r0
; u64 id = bpf_get_current_pid_tgid();
5: (7b) *(u64 *)(r10 -184) = r6
; u32 host_pid = id >> 32;
6: (77) r6 >>= 32
; u32 host_pid = id >> 32;
7: (63) *(u32 *)(r10 -128) = r6
; if (!filter_pids) {
8: (18) r1 = 0xffff8e2ac552eb10
10: (61) r1 = *(u32 *)(r1 +0)
R0_w=inv(id=1) R1_w=map_value(id=0,off=0,ks=4,vs=24,imm=0) R6_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R7_w=inv(id=0) R8_w=inv(id=0) R9_w=inv(id=0) R10=fp0 fp-128=inv fp-184_w=mmmmmmmm
; if (!filter_pids) {
11: (15) if r1 == 0x0 goto pc+175
last_idx 11 first_idx 0
regs=2 stack=0 before 10: (61) r1 = *(u32 *)(r1 +0)
12: (bf) r2 = r10
;
(can provide more logs if needed)
Am I missing a step to rebuild the BPF programs?
Hm, that's a strange error. What's your host OS and kernel version if Linux, where you built the docker image? I can look more into it on Wednesday, but our tests seem to pass, must be local host toolchain problem somehow.
It's an internal debian based distro, kernel version 6.5.13. I can try building on a fresh VM with a vanilla OS version. Is this the right way to build a new image?
IMG_REGISTRY=us-central1-docker.pkg.dev IMG_ORG=myorg IMG_NAME=beyla-foo VERSION=pr-688 make image-build-push
That looks right to me and it's a similar kernel that I use. Let me look into it.
Thanks. I tried rebuilding on a new VM with
Linux <host> 6.1.0-18-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 (2024-02-01) x86_64 GNU/Linux
and seeing the same issue so I think it's unrelated to build environment. Some observations
$ uname -a
Linux gke-cluster-1-pool-1-6a06028c-7xb3 5.15.133+ #1 SMP Sat Dec 30 13:01:38 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
It seems I had left in an earlier commit a debug print in a loop that's unrolled. This sometimes causes issues on older kernels. I pushed an update, can you let me know if this resolves the issue?
I'm still seeing the same error on 797ebdb2 🙁
``` time=2024-03-20T17:56:28.215Z level=INFO msg="Grafana Beyla" Version=v1.4.0-8-g797ebdb2 Revision=797ebdb2 "OpenTelemetry SDK Version"=1.23.1 time=2024-03-20T17:56:28.216Z level=INFO msg="starting Beyla in Application Observability mode" time=2024-03-20T17:56:28.321Z level=INFO msg="using hostname" component=traces.ReadDecorator function=instance_ID_hostNamePIDDecorator hostname=beyla-agent-6fpwc time=2024-03-20T17:56:28.321Z level=WARN msg="No route match patterns configured. Without route definitions Beyla will not be able to generate a low cardinality route for trace span names. For optimal experience, please define your application HTTP route patterns or enable the route 'heuristic' mode. For more information please see the documentation at: https://grafana.com/docs/beyla/latest/configure/options/#routes-decorator . If your application is only using gRPC you can ignore this warning." component=RoutesProvider time=2024-03-20T17:56:28.340Z level=INFO msg="Starting main node" component=beyla.Instrumenter time=2024-03-20T17:56:28.835Z level=INFO msg="instrumenting process" component=discover.TraceAttacher cmd=/usr/local/bin/python3.7 pid=3005143 time=2024-03-20T17:56:28.836Z level=INFO msg="instrumenting process" component=discover.TraceAttacher cmd=/usr/local/bin/ruby pid=3004755 time=2024-03-20T17:56:28.837Z level=INFO msg="instrumenting process" component=discover.TraceAttacher cmd=/usr/local/bin/node pid=3004889 time=2024-03-20T17:56:28.838Z level=INFO msg="instrumenting process" component=discover.TraceAttacher cmd=/usr/local/bin/python3.11 pid=3005753 time=2024-03-20T17:56:28.839Z level=INFO msg="instrumenting process" component=discover.TraceAttacher cmd=/opt/java/openjdk/bin/java pid=3005017 time=2024-03-20T17:56:28.840Z level=INFO msg="new process for already instrumented executable" component=discover.TraceAttacher pid=3005143 child=[] exec=/usr/local/bin/python3.7 time=2024-03-20T17:56:28.841Z level=INFO msg="new process for already instrumented executable" component=discover.TraceAttacher pid=3005089 child=[] exec=/opt/java/openjdk/bin/java time=2024-03-20T17:56:28.841Z level=INFO msg="instrumenting process" component=discover.TraceAttacher cmd=/otelcol-contrib pid=3005755 time=2024-03-20T17:56:28.927Z level=INFO msg="instrumenting process" component=discover.TraceAttacher cmd=/beyla pid=3675805 time=2024-03-20T17:56:36.278Z level=ERROR msg="couldn't trace process. Stopping process tracer" component=ebpf.ProcessTracer path=/usr/local/bin/python3.7 pid=3005143 error="loading and assigning BPF objects: field KprobeTcpSendmsg: program kprobe_tcp_sendmsg: load program: argument list too long: 892: (79) r1 = *(u64 *)(r7 +312): R0_w=mem(id=0,ref_obj_id=27,off=0,imm=0) R1_w=inv(id=0) R6=inv(id=24,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R7=map_value(id=0,off=0,ks=44,vs=448,imm=0) R8=inv(id=0) R9_w=mem(id=0,ref_obj_id=27,off=0,imm=0) R10=fp0 fp-16=mmmmmmmm fp-40=????mmmm fp-48=???????m fp-56=mmmmmmmm fp-64=????mmmm fp-72=mmmmmmmm fp-80=mmmmmmmm fp-88=mmmmmmmm fp-96=mmmmmmmm fp-104=mmmmmmmm fp-112=mmmmmmmm fp-120=mmmmmmmm fp-128=mmmmmmmm fp-152=mmmmmmmm fp-160=mmmmmmmm fp-184=mmmmmmmm fp-192=mmmmmmmm fp-200=mmmmmmmm fp-208=mmmmmmmm fp-216=mmmmmmmm fp-224=mmmmmmmm fp-232=mmmmmmmm fp-240=00000000 fp-248=mmmmm (truncated, 734 line(s) omitted)" ... ```
Do you mind attaching the full log for me? It usually has a bit of useful info at the end.
thanks a ton!
Yup no problem!
I just pushed another change, it looks like we make the program too large. I cut the loop number in half for http2 to see if that helps, but I'm going to need to rework the code a bit more to get back to 8 iterations. I guess our http2 test might fail now, but it should be good to try about this other bug.
@grcevski that fixed the crash and I can confirm the PR change fixed the original issue. I tried restarting the beyla daemonset a few times and no longer see any client IPs in the server.address
Really appreciate you working with me to solve this 😄
Thanks a bunch for testing this out, I really appreciate it! We can merge the fix as is, but I can try the image you have to see if I can put back the loop to 8 for searching HTTP2 frames.
I'm seeing Beyla produce spans with SpanKind
SERVER
that have the client k8s Service's ClusterIP address as theserver.address
. The server span should have the server's pod IP as theserver.address
and the client should be the calling pod's IP. Because the service's ClusterIP is present, it makes me think it should be a client span.I am able to reproduce this issue with the BookInfo sample app from Istio (not using Istio itself) between
reviews-v2
(Java) andratings
(NodeJS) apps. The bug occurs for the first few minutes of Beyla running and doesn't recur later.For example this OTLP is being produced:
OTLP ResourceSpan
``` Resource SchemaURL: https://opentelemetry.io/schemas/1.19.0 Resource attributes: -> k8s.deployment.name: Str(reviews-v2) -> k8s.namespace.name: Str(beyla-repro) -> k8s.node.name: Str(gke-cluster-1-pool-1-6a06028c-7xb3) -> k8s.pod.name: Str(reviews-v2-b7dcd98fb-nzgp5) -> k8s.pod.start_time: Str(2024-03-12 19:38:00 +0000 UTC) -> k8s.pod.uid: Str(ad795a70-093d-4d95-a74b-28cd85302177) -> k8s.replicaset.name: Str(reviews-v2-b7dcd98fb) -> service.instance.id: Str(beyla-agent-dcgx9-3039053) -> service.name: Str(reviews-v2) -> service.namespace: Str(beyla-repro) -> telemetry.sdk.language: Str(java) -> telemetry.sdk.name: Str(beyla) ScopeSpans #0 ScopeSpans SchemaURL: InstrumentationScope github.com/grafana/beyla ... Span #8 Trace ID : ea949df52a54718de0bef345ee7346ba Parent ID : ID : 6aada77a343ebd5e Name : GET /** Kind : Server Start time : 2024-03-12 21:22:43.552211672 +0000 UTC End time : 2024-03-12 21:22:43.553105469 +0000 UTC Status code : Unset Status message : Attributes: -> http.request.method: Str(GET) -> http.response.status_code: Int(200) -> url.path: Str(/ratings/0) -> client.address: Str(10.4.0.122) -> server.address: Str(10.8.11.125) -> server.port: Int(9080) -> http.request.body.size: Int(252) -> http.route: Str(/**) ```SERVER
k8s.pod.name: reviews-v2-b7dcd98fb-nzgp5
client.address: 10.4.0.122
is podreviews-v2-b7dcd98fb-nzgp5
should be the server since it's a server spanserver.address: 10.8.11.125
is not a pod but theratings
service's ClusterIPIt seems like Beyla is misinterpreting an outgoing HTTP request made on the reviews-v2 pod to the ratings service as a server span.
Reproducer
See my Github branch https://github.com/grafana/beyla/compare/main...aabmass:beyla:spankind-bug-repro. I am seeing this consistently in GKE every time I restart the Beyla Daemonset.