open-telemetry / opentelemetry-network

eBPF Collector
https://opentelemetry.io
Apache License 2.0
269 stars 44 forks source link

Failed to compile eBPF code for the Linux distro 'debian' running kernel version 6.5.0-1018-aws. #264

Open ccoqueiro opened 4 months ago

ccoqueiro commented 4 months ago

What happened?

Description

When installing ebpf, the collector kernel pod, although running, emits the following error:

2024-04-25 17:47:50.398732+00:00 debug [p:28721 t:28721] TCPChannel::connect: Conectando a la entrada @ opentelemetry-ebpf-reducer:7000 En el archivo incluido de .. /.. /.. /src/collector/kernel/bpf_src/render_bpf.c:39: En el archivo incluido de include/net/tcp.h:35: En el archivo incluido de include/net/sock_reuseport.h:5: En el archivo incluido de include/linux/filter.h:9: include/linux/bpf.h:321:10: Error: Aplicación no válida de 'sizeof' a un tipo incompleto 'struct bpf_rb_root' return sizeof(struct bpf_rb_root); ^ ~~~~ include/linux/bpf.h:321:24: Nota: declaración directa de 'struct bpf_rb_root' return sizeof(struct bpf_rb_root); ^ include/linux/bpf.h:323:10: Error: Aplicación no válida de 'sizeof' a un tipo incompleto 'struct bpf_rb_node' return sizeof(struct bpf_rb_node);

Important that the following command was run before installation:

sudo apt-get install --yes linux-headers-$(uname -r)

Kernel version: Linux show-no-config-i-05bbcdabc7509e781 6.5.0-1018-aws #18~22.04.1-Ubuntu SMP Fri Apr 5 17:44:33 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Steps to Reproduce

helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts helm repo update open-telemetry helm install my-opentelemetry-ebpf -f ./otel-ebpf-values.yaml open-telemetry/opentelemetry-ebpf check logs of kernel collector pod

Expected Result

transmission of metrics

Actual Result

Errors in data collection.

eBPF Collector version

latest

Environment information

Environment

Kernel version: Linux show-no-config-i-05bbcdabc7509e781 6.5.0-1018-aws #18~22.04.1-Ubuntu SMP Fri Apr 5 17:44:33 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

PRETTY_NAME="Ubuntu 22.04.4 LTS" NAME="Ubuntu" VERSION_ID="22.04" VERSION="22.04.4 LTS (Jammy Jellyfish)" VERSION_CODENAME=jammy ID=ubuntu ID_LIKE=debian HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" UBUNTU_CODENAME=jammy

eBPF Collector configuration

# Default values for opentelemetry-ebpf.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.

nameOverride: ""
fullnameOverride: ""
clusterName: "demohebnpm"

image:
  tag: ""
  registry: otel
  pullPolicy: IfNotPresent

imagePullSecrets: []

resources: {}

# OTLP gRPC endpoint to send the collected metrics
endpoint:
  address: "0.0.0.0"
  port: 4317

log:
  console: true
  # possible values: { error | warning | info | debug | trace }
  level: debug

debug:
  enabled: true
  storeMinidump: false
  sendUnplannedExitMetric: false

kernelCollector:
  enabled: true
  serviceAccount:
    create: true
    name: ""
  image:
    registry: ""
    tag: ""
    name: opentelemetry-ebpf-kernel-collector

  nodeSelector: {}
  disableHttpMetrics: false

  tolerations:
    - operator: "Exists"
      effect: "NoExecute"
    - operator: "Exists"
      effect: "NoSchedule"

  affinity: {}
  resources: {}

  # uncomment the line below to disable automatic kernel headers fetching
fetchKernelHeaders: true

  # uncomment to enable enrichment using Docker metadata
  useDockerMetadata: true

  # uncomment to enable enrichment using Nomad metadata (https://www.nomadproject.io/)
  collectNomadMetadata: true

cloudCollector:
  enabled: false
  image:
    registry: ""
    tag: ""
    name: opentelemetry-ebpf-cloud-collector

  serviceAccount:
    create: true
    name: ""
    annotations: {}
      ## eks.amazonaws.com/role-arn: "role-arn-name"

  tolerations: []
  affinity: {}

k8sCollector:
  enabled: true
  serviceAccount:
    create: true
    name: ""
  relay:
    image:
      registry: ""
      tag: ""
      name: opentelemetry-ebpf-k8s-relay
  watcher:
    image:
      registry: ""
      tag: ""
      name: opentelemetry-ebpf-k8s-watcher

  tolerations: []
  affinity: {}

reducer:
  image:
    registry: ""
    tag: ""
    name: opentelemetry-ebpf-reducer
  extraArgs: {}
  ingestShards: 1
  matchingShards: 1
  aggregationShards: 1
  disableInternalMetrics: true
  disableMetrics: []
    ### to disable an entire metric category: ###
# - tcp.all
    # - udp.all
    # - dns.all
    # - http.all
    ### to disable an individual metric: ###
    ### tcp ###
    # - tcp.bytes
    # - tcp.rtt.num_measurements
    # - tcp.active
    # - tcp.rtt.average
    # - tcp.packets
    # - tcp.retrans
    # - tcp.syn_timeouts
    # - tcp.new_sockets
    # - tcp.resets
    ### udp ###
    # - udp.bytes
    # - udp.packets
    # - udp.active
    # - udp.drops
    ### dns ###
    # - dns.client.duration.average
    # - dns.server.duration.average
    # - dns.active_sockets
    # - dns.responses
    # - dns.timeouts
    ### http ##
    # - http.client.duration.average
    # - http.server.duration.average
    # - http.active_sockets
    # - http.status_code
    ### ebpf_net ##
    # - ebpf_net.span_utilization_fraction
    # - ebpf_net.pipeline_metric_bytes_discarded
    # - ebpf_net.codetiming_min_ns
    # - ebpf_net.entrypoint_info
    # - ebpf_net.otlp_grpc.requests_sent
    # - ebpf_net.connections
    # - ebpf_net.rpc_queue_elem_utilization_fraction
    # - ebpf_net.disconnects
    # - ebpf_net.codetiming_avg_ns
    # - ebpf_net.client_handle_pool
    # - ebpf_net.otlp_grpc.successful_requests
    # - ebpf_net.span_utilization
    # - ebpf_net.up
    # - ebpf_net.rpc_queue_buf_utilization_fraction
    # - ebpf_net.collector_log_count
    # - ebpf_net.time_since_last_message_ns
    # - ebpf_net.bpf_log
    # - ebpf_net.codetiming_count
    # - ebpf_net.message
    # - ebpf_net.otlp_grpc.bytes_sent
    # - ebpf_net.pipeline_message_error
    # - ebpf_net.pipeline_metric_bytes_written
    # - ebpf_net.codetiming_max_ns
  # - ebpf_net.codetiming_sum_ns
    # - ebpf_net.otlp_grpc.failed_requests
    # - ebpf_net.rpc_queue_buf_utilization
    ### to enable all metrics (including metrics turned off by default): ###
    # - none
  enableMetrics: []
    ### Disable metrics flag is evaluated first and only then enable metric flag is evaluated. ###
    ### to enable an entire metric category: ###
    # - tcp.all
    # - udp.all
    # - dns.all
    # - http.all
    # - ebpf_net.all
    ### to enable an individual metric: ###
    ### tcp ###
    # - tcp.bytes
    # - tcp.rtt.num_measurements
    # - tcp.active
    # - tcp.rtt.average
    # - tcp.packets
    # - tcp.retrans
    # - tcp.syn_timeouts
    # - tcp.new_sockets
    # - tcp.resets
    ### udp ###
    # - udp.bytes
    # - udp.packets
    # - udp.active
    # - udp.drops
    ### dns ###
    # - dns.client.duration.average
    # - dns.server.duration.average
    # - dns.active_sockets
    # - dns.responses
    # - dns.timeouts
    ### http ###
    # - http.client.duration.average
    # - http.server.duration.average
    # - http.active_sockets
    # - http.status_code
    ### ebpf_net ###
    # - ebpf_net.span_utilization_fraction
    # - ebpf_net.pipeline_metric_bytes_discarded
    # - ebpf_net.codetiming_min_ns
    # - ebpf_net.entrypoint_info
    # - ebpf_net.otlp_grpc.requests_sent
    # - ebpf_net.connections
    # - ebpf_net.rpc_queue_elem_utilization_fraction
    # - ebpf_net.disconnects
    # - ebpf_net.codetiming_avg_ns
    # - ebpf_net.client_handle_pool
    # - ebpf_net.otlp_grpc.successful_requests
    # - ebpf_net.span_utilization
    # - ebpf_net.rpc_queue_elem_utilization_fraction
    # - ebpf_net.disconnects
    # - ebpf_net.codetiming_avg_ns
    # - ebpf_net.client_handle_pool
    # - ebpf_net.otlp_grpc.successful_requests
    # - ebpf_net.span_utilization
    # - ebpf_net.up
    # - ebpf_net.rpc_queue_buf_utilization_fraction
    # - ebpf_net.collector_log_count
    # - ebpf_net.time_since_last_message_ns
    # - ebpf_net.bpf_log
    # - ebpf_net.codetiming_count
    # - ebpf_net.message
    # - ebpf_net.otlp_grpc.bytes_sent
    # - ebpf_net.pipeline_message_error
    # - ebpf_net.pipeline_metric_bytes_written
    # - ebpf_net.codetiming_max_ns
    # - ebpf_net.span_utilization_max
    # - ebpf_net.client_handle_pool_fraction
    # - ebpf_net.span_utilization_fraction
    # - ebpf_net.rpc_latency_ns
    # - ebpf_net.agg_root_truncation
    # - ebpf_net.clock_offset_ns
    # - ebpf_net.otlp_grpc.metrics_sent
    # - ebpf_net.otlp_grpc.unknown_response_tags
    # - ebpf_net.collector_health
    # - ebpf_net.codetiming_sum_ns
    # - ebpf_net.otlp_grpc.failed_requests
    # - ebpf_net.rpc_queue_buf_utilization

  resources: {}
  nodeSelector: {}
  tolerations: []
  affinity: {}
  service:
    type: ClusterIP
    ports:
      telemetry:
        enabled: true
        servicePort: 7000
        containerPort: 7000
        targetPort: 7000
        protocol: TCP
        appProtocol: http
      stats:
        enabled: true
        servicePort: 7001
        containerPort: 7001
        targetPort: 7001
        protocol: TCP
        appProtocol: http

rbac:
  create: true

Log output

PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
PWD=/srv
SHLVL=0
SSL_CERT_DIR=/etc/ssl/certs
_=/usr/bin/env
===========================================================
resolving kernel headers...
cleaning up stale kprobes...
launching kernel collector...
+ exec /srv/kernel-collector --host-distro debian --kernel-headers-source pre_installed --config-file=/etc/network-explorer/config.yaml --force-docker-metadata --log-console --debug
2024-04-25 17:47:41.000682+00:00 debug [p:28721 t:28721] setting up breakpad...
2024-04-25 17:47:41.000794+00:00 debug [p:28721 t:28721] setting up breakpad...
2024-04-25 17:47:41.000909+00:00 info [p:28721 t:28721] Starting Kernel Collector version 0.10.0 (release)
2024-04-25 17:47:41.000921+00:00 info [p:28721 t:28721] Kernel Collector agent ID is FAIDIN4D2V25Q0YAXWQK8F1QSLM5FG688BQB
2024-04-25 17:47:41.000925+00:00 info [p:28721 t:28721] Running on:
   sysname: Linux
  nodename: show-no-config-i-05bbcdabc7509e781
   release: 6.5.0-1018-aws
   version: #18~22.04.1-Ubuntu SMP Fri Apr  5 17:44:33 UTC 2024
   machine: x86_64
2024-04-25 17:47:41.000947+00:00 info [p:28721 t:28721] HTTP Metrics: Enabled
2024-04-25 17:47:41.000949+00:00 info [p:28721 t:28721] Socket stats interval in seconds: 10
2024-04-25 17:47:41.000950+00:00 info [p:28721 t:28721] Userland TCP: Disabled
2024-04-25 17:47:41.007377+00:00 debug [p:28721 t:28721] Unable to fetch AWS metadata: no metadata returned by AWS
2024-04-25 17:47:41.019944+00:00 debug [p:28721 t:28721] Unable to fetch GCP metadata: error while fetching Google Cloud Platform instance metadata: Could not resolve host: metadata.google.internal
2024-04-25 17:47:41.019960+00:00 debug [p:28721 t:28721] Unable to fetch Nomad metadata - environment variables not found
2024-04-25 17:47:41.019970+00:00 info [p:28721 t:28721] Kernel Collector version 0.10.0 (release) started on host show-no-config-i-05bbcdabc7509e781
2024-04-25 17:47:41.020086+00:00 info [p:28721 t:28721] Node label has been set in config: 'environment':'demohebnpm'
2024-04-25 17:47:41.047126+00:00 debug [p:28721 t:28721] intake record file: ``
2024-04-25 17:47:41.047191+00:00 debug [p:28721 t:28721] starting event loop...
2024-04-25 17:47:50.398714+00:00 info [p:28721 t:28721] connecting to opentelemetry-ebpf-reducer:7000 (binary)...
2024-04-25 17:47:50.398732+00:00 debug [p:28721 t:28721] TCPChannel::connect: Connecting to intake @ opentelemetry-ebpf-reducer:7000
In file included from ../../../src/collector/kernel/bpf_src/render_bpf.c:39:
In file included from include/net/tcp.h:35:
In file included from include/net/sock_reuseport.h:5:
In file included from include/linux/filter.h:9:
include/linux/bpf.h:321:10: error: invalid application of 'sizeof' to an incomplete type 'struct bpf_rb_root'
                return sizeof(struct bpf_rb_root);
                       ^     ~~~~~~~~~~~~~~~~~~~~
include/linux/bpf.h:321:24: note: forward declaration of 'struct bpf_rb_root'
                return sizeof(struct bpf_rb_root);
                                     ^
include/linux/bpf.h:323:10: error: invalid application of 'sizeof' to an incomplete type 'struct bpf_rb_node'
                return sizeof(struct bpf_rb_node);
                       ^     ~~~~~~~~~~~~~~~~~~~~
include/linux/bpf.h:323:24: note: forward declaration of 'struct bpf_rb_node'
                return sizeof(struct bpf_rb_node);
                                     ^
include/linux/bpf.h:325:10: error: invalid application of 'sizeof' to an incomplete type 'struct bpf_refcount'
                return sizeof(struct bpf_refcount);
                       ^     ~~~~~~~~~~~~~~~~~~~~~
include/linux/bpf.h:325:24: note: forward declaration of 'struct bpf_refcount'
                return sizeof(struct bpf_refcount);
                                     ^
include/linux/bpf.h:347:10: error: invalid application of '__alignof' to an incomplete type 'struct bpf_rb_root'
                return __alignof__(struct bpf_rb_root);
                       ^          ~~~~~~~~~~~~~~~~~~~~
include/linux/bpf.h:347:29: note: forward declaration of 'struct bpf_rb_root'
                return __alignof__(struct bpf_rb_root);
                                          ^
include/linux/bpf.h:349:10: error: invalid application of '__alignof' to an incomplete type 'struct bpf_rb_node'
                return __alignof__(struct bpf_rb_node);
                       ^          ~~~~~~~~~~~~~~~~~~~~
include/linux/bpf.h:349:29: note: forward declaration of 'struct bpf_rb_node'
                return __alignof__(struct bpf_rb_node);
                                          ^
include/linux/bpf.h:351:10: error: invalid application of '__alignof' to an incomplete type 'struct bpf_refcount'
                return __alignof__(struct bpf_refcount);
                       ^          ~~~~~~~~~~~~~~~~~~~~~
include/linux/bpf.h:351:29: note: forward declaration of 'struct bpf_refcount'
                return __alignof__(struct bpf_refcount);
                                          ^
../../../src/collector/kernel/bpf_src/tcp-processor/bpf_tcp_send_recv.h:184:53: error: no member named 'iov' in 'struct iov_iter'
  bpf_probe_read(&iov, sizeof(iov), &(msg->msg_iter.iov));
                                      ~~~~~~~~~~~~~ ^
../../../src/collector/kernel/bpf_src/tcp-processor/bpf_tcp_send_recv.h:393:53: error: no member named 'iov' in 'struct iov_iter'
  bpf_probe_read(&iov, sizeof(iov), &(msg->msg_iter.iov));
                                      ~~~~~~~~~~~~~ ^
8 errors generated.
2024-04-25 17:47:56.205695+00:00 error [p:28721 t:28721] Cannot initialize BPF program, res=-1

Failed to compile eBPF code for the Linux distro 'debian' running kernel version 6.5.0-1018-aws.

troubleshoot item bpf_compilation_failed (os=Linux,flavor=debian,headers_src=pre_installed,kernel=6.5.0-1018-aws): ProbeHandler couldn't load BPFModule: Success

This usually means that kernel headers weren't installed correctly.

Please reach out to support and include this log in its entirety so we can diagnose and fix
the problem.

In the meantime, please install kernel headers manually on each host before running
the Kernel Collector.

To manually install kernel headers, follow the instructions below:

  - for Debian/Ubuntu based distros, run:

      sudo apt-get install --yes "linux-headers-`uname -r`"

  - for RedHat based distros like CentOS and Amazon Linux, run:

      sudo yum install -y "kernel-devel-`uname -r`"

Additional context

No response

yonch commented 4 months ago

The first set of errors (include/linux/bpf.h), at first glance, could be due to some internal inconsistency in the kernel headers. For example take the first error:

so there should be a full definition -- curious.

@ccoqueiro would the package repository used to install the packages contain recent versions of the headers? Is the kernel on that machine a recent release in the distro?

yonch commented 4 months ago

The two errors in bpf_tcp_send_recv.h:

So we'd want to figure out what Iter_iov() does and handle the modified structure with an #if LINUX_VERSION_CODE < KERNEL_VERSION(6, 4, 0) (edit: the < case would contain old code, and the #else for the new)

ccoqueiro commented 4 months ago

Hello @yonch , I understand that yes, I'm using the chart opentelemetry ebpf package -> https://github.com/open-telemetry/opentelemetry-helm-charts/tree/main/charts/opentelemetry-ebpf

yonch commented 4 months ago

@ccoqueiro I'm wondering if the header package might somehow be old/broken, is one of these true in your case:

  1. The package repo for the distro (used by apt) is not standard
  2. The kernel header package was installed a long time ago and not updated
  3. The machine is running a bleeding edge kernel for the distro (so the header packaging might be work-in-progress)

and if the answer is no, a couple of things to try:

note that these will probably only fix the first set of errors. The second set requires modifications in the eBPF code. Are you in a position to pursue those, or should we search for community contributors?

ccoqueiro commented 4 months ago

Hello @yonch

Answering questions:

  1. The package repo for the distro (used by apt) is not standard. The distro I used is an ubuntu 22.04 provided by AWS, I understand it's standard.
  2. The kernel header package was installed a long time ago and not up. The kernel header package was not installed, I installed it as a prerequisite for the installation of otel ebpf.
  3. The machine is running a bleeding edge kernel for the distro (so the header packaging might be work-in-progress) .I can't answer this question, how could we check this?

updating the packages on the system apt-get upgrade, see if that fixes the headers. Done but not fixed the headers. running on a machine that does not have headers (e.g., without first running sudo apt-get install --yes linux-headers-$(uname -r), so letting the network collector fetch its own headers. I ran this command, installing the package reader before installing the ebpf otel, but it didn't help, it kept giving the same error.

The second set requires modifications in the eBPF code. Are you in a position to pursue those, or should we search for community contributors? To be quite honest with you, I have no idea how I would do this.

yonch commented 4 months ago

Got it @ccoqueiro, I marked with "help wanted" and will direct contributors here if asked. I'm sorry I don't have anything more immediate for you. If you find anyone who would like to tackle, happy to work with them!