grafana / beyla

eBPF-based autoinstrumentation of web applications and network metrics
https://grafana.com/oss/beyla-ebpf/
Apache License 2.0
1.23k stars 78 forks source link

After 1 minute beyla doesn't generate webserver traces #573

Closed IPV4ISIPV5 closed 1 month ago

IPV4ISIPV5 commented 4 months ago

The problem manifests itself in a lack of trace generation after about a minute of beyla operation. I have tried in every way to change the configuration, but practically every time similar problems occur. I am currently trying to auto-instrumentation the apache2 server as a standalone linux process. Similar problems manifested themselves with the test quickstart.py file. Are these configuration problems or rather problems with the program itself?

OS: ubuntu 22.04.1 Linux kernel: 6.5.0-14-generic

cat /sys/kernel/security/lockdown [none] integrity confidentiality

config file

#open_port: 80
#service_name: apache2
#service_namespace: 4026531836
log_level: DEBUG
print_traces: true
prometheus_export:
  port: 8889
  path: /metrics
routes:
  patterns:
    # - /**
    - /art/**
    - /man/**
    - /woman/**
  unmatched: heuristic
internal_metrics:
  port: 8889
ebpf:
  wakeup_len: 10
  track_request_headers: false
attributes:
  instance_id:
    dns: false
    override_instance_id: apache2
discovery:
  services:
    - name: apache2
      open_ports: 80
      exe_path: /usr/sbin/apache2
  skip_go_specific_tracers: true
    # - namespace: 4026531836
otel_traces_export:
  reporters_cache_len: 1024
  sampler:
    name: "always_on"
otel_metrics_export:
  reporters_cache_len: 1024

debug log beyla_debug.log

grcevski commented 4 months ago

Thanks for reporting this @IPV4ISIPV5, this looks like a bug to me. I can't see anything obviously wrong in the beyla output, why we stopped tracking the events.

I'm going to try to reproduce the problem with your configuration file here and see if anything shows up.

One thing that will help us investigate this issue if you perhaps are able to share a log with us on what's happening on the eBPF side. To do this:

  1. In your Beyla configuration above, under the ebpf: section, please add bfp_debug: true. This will cause us to load the debug version of our BPF programs and generate logs.
  2. In another terminal window, run:
    sudo cat /sys/kernel/debug/tracing/trace_pipe

    If you can send us a section of that trace_pipe output log, after the instrumentation stops we can see if the eBPF code is still receiving events or something failed. The more output log you send us the better.

Thanks so much in advance!

grcevski commented 4 months ago

I tried on a very similar system, Ubuntu 22.04 with a 6.5.0-14 kernel. I can't seem to reproduce the issue with a sample Apache2 server I made and with the python quickstart. I ran a connection test with 10 minutes, seems to work. I'll investigate a bit more.

I do have one question related to the requests Beyla reported in your logs. The size of the incoming request is a bit unusual, or larger than I'd have expected, e.g:

(102.994436ms[102.994436ms]) 200 GET  [10.10.22.71]->[10.0.4.13:80] size:2057B svc=[apache2 generic] traceparent=[]
(252.548172ms[252.548172ms]) 200 GET /9-art [10.10.22.71]->[10.0.4.13:80] size:1828B svc=[apache2 generic] traceparent=[]

That output suggests the incoming GET requests had headers or body of > 1.5KB, is that expected? It's a bit large and perhaps it's the key for reproducing this issue? My requests were tiny, total size 50 B or so.

grcevski commented 4 months ago

I tried with larger request sizes and it seems to work consistently, here's an example of my output with 300KB:

2024-01-25 12:58:21.125125821 (550.124µs[550.124µs]) 200 GET /header.html [172.17.0.1]->[172.17.0.2:80] size:343548B svc=[apache2] traceparent=[00-cfa362313250439d82b042a826191d50-0000000000000000-01]

At this point I think we'll need more information, so we'll need those bpf logs. Also, if you can verify that your apache2 process is still live after that point it would be good.

IPV4ISIPV5 commented 4 months ago

Thank you for your reply.

Below logs from apache2, ebpf and breyla prestashop-access.log trace_pipe_part_1.log trace_pipe_part_2.log trace_pipe_part_3.log beyla_logs.txt

The logs provided are from the prestashop application with the default configuration hence the high request sizes.

I also tried the version on the docker, it longer collect the tracer, but after a short time a similar problem occurs. docker_beyla.log

docker_beyla.yml

version: "3.8"

services:
  # Service to instrument. Change it to any
  # other container that you want to instrument.
  nextcloud:
    image: nextcloud
    container_name: nextcloud_test
    ports:
      # Exposes port 18843, forwarding it to container port 8443
      - "8080:80"
      - "9000:8889"

  autoinstrumenter:
    image: grafana/beyla:latest
    pid: "service:nextcloud"
    privileged: true
    # If using the above capability fails to instrument your service, remove it
    # and uncomment the line below
    # privileged: true
    environment:
      BEYLA_PRINT_TRACES: true
      # BEYLA_OPEN_PORT: 80
      BEYLA_SYSTEM_WIDE: true
      BEYLA_BPF_WAKEUP_LEN: 1
      BEYLA_PROMETHEUS_PORT: 8889
      BEYLA_LOG_LEVEL: debug
      BEYLA_SKIP_GO_SPECIFIC_TRACERS: true
      BEYLA_DISCOVERY_POLL_INTERVAL: "500ms"

Below logs from quickstart.py with the same problem beyla_quickstart.log trace_pipe_quickstart.log quickstart.log

On the example golang server, the beyla worked without any problem. Unfortunately I cannot find it on my server. Are there generally problems with programs written in a language other than Go?

grcevski commented 4 months ago

Thanks so much, this helps a ton actually.

First let me answer the question:

On the example golang server, the beyla worked without any problem. Unfortunately I cannot find it on my server. Are there generally problems with programs written in a language other than Go?

For Go we use different instrumentation because generally the Go runtime is very deterministic and it's easy to instrument with eBPF. For the others we use Linux socket based instrumentation, so it works differently.

I see two bugs from the logs:

  1. The quickstart example doesn't work because the HTTP client is using the older HTTP 1.0 protocol, not 1.1, we sort assumed 1.1 and I'll make a PR to fix this.
  2. The second issue is something I thought I fixed, but obviously not. In the apache scenario the client is HTTP 1.1, but we somehow lose the information from the incoming connection. I have enough information to go on about this, I'll follow up with a fix I hope soon.
grcevski commented 4 months ago

Hi again @IPV4ISIPV5,

I stared a bit more at the logs and I definitely see us missing events regularly, but it's not the issue I thought it was. This appears to be a VM environment, what does your setup look like? Is this VMware (I see vmtoolsd in the logs switching in and out)? I think I need to setup something similar to see what's happening.

The kernel trace logs report something like: CPU:3 [LOST 368 EVENTS]

So collection of data will be spotty, but I'm surprised it stops completely.

IPV4ISIPV5 commented 4 months ago

Hi @grcevski Yes, the virtual machine is running with compatibility with ESXi 7.0 U2 and later (VM version 19) and second virtual machine with ESXi 6.5 and later (VM version 13). Firmware is set to BIOS.

If you need any more information then please ask, I will try to provide as much as possible. Do you also need information on the amount of CPU, memory?

grcevski commented 4 months ago

Hi @IPV4ISIPV5, I believe I tracked down what's happening and why we miss these events. It seems that it has to do with configuration of number of processors and actually being able to use those processors effectively. I haven't been able to successfully reproduce, but I think we can provide a fix.

Essentially, for instrumentation other than Go we use eBPF kprobe and kretprobe to track various events in the kernel. One such event is tcp_recvmsg, where we capture what's sent on the sockets. The kprobe on that function lets us capture the address of the buffer where the data is received, and finally on the kretprobe of tcp_recvmsg we can read what was received. So we need both events.

Now what's happening in your logs is something like this:

         === tcp_recvmsg id=107363 sock=ffff8ef1051b9200 === < kprobe event, id is the thread id
         msg iter type exists, read value 5
         msg type 0, iter type 5
         new kernel, iov doesn't exist
         ubuf base 557671e1f6f0
         === tcp_recvmsg id=107363 sock=ffff8ef1051b9200 === < kprobe event, id is the thread id
         msg iter type exists, read value 5
         msg type 0, iter type 5
         new kernel, iov doesn't exist
         ubuf base 557671e1f6f0

So we see two kprobe events on the same thread ID, but there's no kretprobe, so no reading of the buffer happens.

When it does work, we correctly see the retprobe, for example:

         === tcp_recvmsg id=107363 sock=ffff8ef11a60da00 === < kprobe
         msg iter type exists, read value 5
         msg type 0, iter type 5
         new kernel, iov doesn't exist
         ubuf base 557671e1f6f0
         === tcp_recvmsg ret id=107363 sock=ffff8ef11a60da00 copied_len 2129 === < kretprobe
         buf=[GET /qwerty ], pid=107363

After researching this a bit, I found couple of theads discussing limitations of how many kretprobes can happen at the same time. If there's high network pressure, kretprobes will be dropped and there's a maxactive count, which is auto configured to the number of CPUs.

First link that described the exact issue we are seeing: https://github.com/iovisor/bpftrace/issues/835

The maxactive doesn't seem to be configurable with a kernel option, the only way to impact this behaviour with configuration is to increase the CPUs available to the virtual machine or faster network. Holding up the tcp_recvmsg for too long will result in the kretprobes being cancelled.

Fortunately, the library we use to install the BPF events from Cilium has added support for configuring the number of allowed active kretprobes and it's now merged https://github.com/cilium/ebpf/pull/755.

I'm going to see I can make use of that setting and if it does work, then I'll make a PR, probably we'll make it configurable option and document that it might be needed for high network volume applications.

I also have an alternative solution in mind. We used to track some of the events with a socket filter, but we removed that since tracking tcp events was easier with tcp_sendmsg and tcp_recvmsg. I can bring back that code, which can serve as a fall back for missing kretprobes on tcp_recvmsg.

grcevski commented 4 months ago

I opened a PR to reintroduce the socket filter as a fallback. I hope we'll be able to merge this PR on Monday. Still looking at increasing the number of active kprobes.

IPV4ISIPV5 commented 4 months ago

Hi @grcevski, Thank you for the extensive explanation of the cause of the problem. This is an invaluable handful of knowledge for me. If the pull request is merged today does that mean there will already be a new beyla release with this fix?

grcevski commented 4 months ago

Hi @IPV4ISIPV5, we just merged the attempted with for the issue which you are experiencing. Is it possible for you to verify if this resolves the problem? We haven't cut an official release yet with this fix, because I want to make sure it actually resolves the problem. To test this you'd need to make sure you are using beyla:main as a container or build manually the executables. Just make sure that you delete the previous cached beyla:main so that we are sure it's the latest.

If using beyla:main is not possible, please let us know and I'll see about making a pre-release version.

IPV4ISIPV5 commented 4 months ago

At first glance in the docker image, I could see that traces were being generated for a long time. Unfortunately I have no experience with building an executable from source. If you can I will ask for a version already compiled, then I can test it tomorrow and provide any information including logs.

grcevski commented 4 months ago

Great, thanks for letting us know. I think running for a long time on Docker is enough to confirm the issue is fixed, no need to try with manually built executable, I was just providing the options. Please let me know if it stops for some reason, we'll plan to include this fix in the next release.

IPV4ISIPV5 commented 4 months ago

Hi @grcevski, the revised version is working properly for longer. However, it is still possible to notice the loss of some traces. I do not know to what extent this is intentional and to what extent it is the result of a bug. I am attaching the logs and the excerpt that I found most interesting. Is this related to the problem?

autoinstrumenter-1 | time=2024-01-30T15:06:41.957Z level=DEBUG msg="submitting traces on timeout" component=httpfltr.Tracer len=1

docker_beyla.log

grcevski commented 4 months ago

Great, thanks for sharing with us again! I'm going to check this new log and see if I notice another issue. We hit a problem with merging the second fix to extend the available ongoing requests, because it seems to require mounting the tracefs in containers, so it was complicated. Our approach is to provide a metric now that will show if events are missing first and log a warning. I'm hoping that we'll come up with an elegant solution to the other issue as well and make the fix more reliable.

grcevski commented 4 months ago

One immediate suggestion I have is, perhaps switch away from using system_wide, (it looks like it's enabled) and point Beyla to monitor the server port only: 80. It will have to monitor a lot less events.

IPV4ISIPV5 commented 4 months ago

HI @grcevski, after several tests I notice a significant improvement. In places there are times when apache2 generates quite a lot of traffic logs, whereas beyla only generates a snippet of that. There was a strange error in the docker_beyla_first_attempt.txt file:

autoinstrumenter-1 | time=2024-01-31T08:48:46.136Z level=DEBUG msg="Error looking up namespace for removing PID" component=ebpfCommon.CommonPIDsFilter pid=44 error="failed to open(/proc/44/ns/pid): open /proc/44/ns/pid: no such file or directory" autoinstrumenter-1 | time=2024-01-31T08:48:46.136Z level=DEBUG msg="Error looking up namespace for removing PID" component=ebpfCommon.CommonPIDsFilter pid=44 error="failed to open(/proc/44/ns/pid): open /proc/44/ns/pid: no such file or directory"

in the second attempt I did not find this error. My plan is to hook up this docker instance to Tempo and Mimir to see if there are holes in the long run and if this affects the statistics.

docker_beyla_first_attempt.txt docker_beyla.log

mariomac commented 1 month ago

Closing as seems to be solved. @IPV4ISIPV5 feel free to reopen it if you get still the same error.