grafana / alloy

OpenTelemetry Collector distribution with programmable pipelines
https://grafana.com/oss/alloy
Apache License 2.0
1.39k stars 203 forks source link

Pyroscope Java profiling not working after following documentation (missing Linux capabilities) #1616

Closed caspar-ds closed 3 weeks ago

caspar-ds commented 1 month ago

What's wrong?

After following the documentation here, profiling of Java processes results in the following errors for all processes:

{"ts":"2024-09-04T20:14:23.991554124Z","level":"error","msg":"failed to get process info","component_path":"/","component_id":"discovery.process.all","err":"readlink /proc/1/exe: permission denied","pid":1}
{"ts":"2024-09-04T20:14:23.991644175Z","level":"error","msg":"failed to get process info","component_path":"/","component_id":"discovery.process.all","err":"readlink /proc/2/exe: permission denied","pid":2}
{"ts":"2024-09-04T20:14:23.991677615Z","level":"error","msg":"failed to get process info","component_path":"/","component_id":"discovery.process.all","err":"readlink /proc/3/exe: permission denied","pid":3}
{"ts":"2024-09-04T20:14:23.991708775Z","level":"error","msg":"failed to get process info","component_path":"/","component_id":"discovery.process.all","err":"readlink /proc/4/exe: permission denied","pid":4}
{"ts":"2024-09-04T20:14:23.991749485Z","level":"error","msg":"failed to get process info","component_path":"/","component_id":"discovery.process.all","err":"readlink /proc/6/exe: permission denied","pid":6}
{"ts":"2024-09-04T20:14:23.991779286Z","level":"error","msg":"failed to get process info","component_path":"/","component_id":"discovery.process.all","err":"readlink /proc/8/exe: permission denied","pid":8}
{"ts":"2024-09-04T20:14:23.991807106Z","level":"error","msg":"failed to get process info","component_path":"/","component_id":"discovery.process.all","err":"readlink /proc/9/exe: permission denied","pid":9}
...

Helm values:

alloy:
  configMap:
    create: false
    name: alloy-config
    key: config.alloy
  stabilityLevel: "generally-available"
  enableReporting: false
  securityContext:
    runAsUser: 0

controller:
  type: daemonset
  hostPID: true

Alloy config:

logging {
    level  = "info"
    format = "json"
}

discovery.kubernetes "local_pods" {
  selectors {
    field = "spec.nodeName=" + env("HOSTNAME")
    role = "pod"
  }
  role = "pod"
}

discovery.relabel "java_pods" {
  targets = discovery.kubernetes.local_pods.targets
  // Filter only java processes
  rule {
    source_labels = ["__meta_process_exe"]
    action = "keep"
    regex = ".*/java$"
  }
  rule {
    action = "drop"
    regex = "Succeeded|Failed|Completed"
    source_labels = ["__meta_kubernetes_pod_phase"]
  }
  rule {
    action = "replace"
    source_labels = ["__meta_kubernetes_namespace"]
    target_label = "namespace"
  }
  rule {
    action = "replace"
    source_labels = ["__meta_kubernetes_pod_name"]
    target_label = "pod"
  }
  rule {
    action = "replace"
    source_labels = ["__meta_kubernetes_pod_node_name"]
    target_label = "node"
  }
  rule {
    action = "replace"
    source_labels = ["__meta_kubernetes_pod_container_name"]
    target_label = "container"
  }
  // Provide arbitrary service_name label, otherwise it will be inferred from discovery labels automatically
  rule {
    action = "replace"
    regex = "(.*)@(.*)"
    replacement = "java/${1}/${2}"
    separator = "@"
    source_labels = ["__meta_kubernetes_namespace", "__meta_kubernetes_pod_container_name"]
    target_label = "service_name"
  }
}

pyroscope.java "java" {
  forward_to = [pyroscope.write.pyroscope_write.receiver]
  targets = discovery.relabel.java_pods.output
}

pyroscope.write "pyroscope_write" {
    endpoint {
        url = "http://pyroscope.pyroscope.svc.cluster.local:4040"
    }
}

Steps to reproduce

Install Alloy in a Kubernetes cluster using the above values and configuration

System information

Linux version 5.10.223-212.873.amzn2.x86_64

Software version

Grafana Alloy v1.3.1

Configuration

logging {
    level  = "info"
    format = "json"
}

discovery.kubernetes "local_pods" {
  selectors {
    field = "spec.nodeName=" + env("HOSTNAME")
    role = "pod"
  }
  role = "pod"
}

discovery.relabel "java_pods" {
  targets = discovery.kubernetes.local_pods.targets
  // Filter only java processes
  rule {
    source_labels = ["__meta_process_exe"]
    action = "keep"
    regex = ".*/java$"
  }
  rule {
    action = "drop"
    regex = "Succeeded|Failed|Completed"
    source_labels = ["__meta_kubernetes_pod_phase"]
  }
  rule {
    action = "replace"
    source_labels = ["__meta_kubernetes_namespace"]
    target_label = "namespace"
  }
  rule {
    action = "replace"
    source_labels = ["__meta_kubernetes_pod_name"]
    target_label = "pod"
  }
  rule {
    action = "replace"
    source_labels = ["__meta_kubernetes_pod_node_name"]
    target_label = "node"
  }
  rule {
    action = "replace"
    source_labels = ["__meta_kubernetes_pod_container_name"]
    target_label = "container"
  }
  // Provide arbitrary service_name label, otherwise it will be inferred from discovery labels automatically
  rule {
    action = "replace"
    regex = "(.*)@(.*)"
    replacement = "java/${1}/${2}"
    separator = "@"
    source_labels = ["__meta_kubernetes_namespace", "__meta_kubernetes_pod_container_name"]
    target_label = "service_name"
  }
}

pyroscope.java "java" {
  forward_to = [pyroscope.write.pyroscope_write.receiver]
  targets = discovery.relabel.java_pods.output
}

pyroscope.write "pyroscope_write" {
    endpoint {
        url = "http://pyroscope.pyroscope.svc.cluster.local:4040"
    }
}

Logs

{"ts":"2024-09-04T20:14:23.783384097Z","level":"info","boringcrypto enabled":false}
{"ts":"2024-09-04T20:14:23.783438837Z","level":"info","msg":"starting complete graph evaluation","controller_path":"/","controller_id":"","trace_id":"c1709a86c9849a4bc42d9586b31a673f"}
{"ts":"2024-09-04T20:14:23.783469298Z","level":"info","msg":"finished node evaluation","controller_path":"/","controller_id":"","trace_id":"c1709a86c9849a4bc42d9586b31a673f","node_id":"logging","duration":119411}
{"ts":"2024-09-04T20:14:23.783507508Z","level":"info","msg":"finished node evaluation","controller_path":"/","controller_id":"","trace_id":"c1709a86c9849a4bc42d9586b31a673f","node_id":"labelstore","duration":9270}
{"ts":"2024-09-04T20:14:23.783551438Z","level":"info","msg":"finished node evaluation","controller_path":"/","controller_id":"","trace_id":"c1709a86c9849a4bc42d9586b31a673f","node_id":"tracing","duration":8990}
{"ts":"2024-09-04T20:14:23.783574438Z","level":"info","msg":"finished node evaluation","controller_path":"/","controller_id":"","trace_id":"c1709a86c9849a4bc42d9586b31a673f","node_id":"otel","duration":1810}
{"ts":"2024-09-04T20:14:23.78361252Z","level":"info","msg":"applying non-TLS config to HTTP server","service":"http"}
{"ts":"2024-09-04T20:14:23.78362498Z","level":"info","msg":"finished node evaluation","controller_path":"/","controller_id":"","trace_id":"c1709a86c9849a4bc42d9586b31a673f","node_id":"http","duration":32892}
{"ts":"2024-09-04T20:14:23.783952992Z","level":"info","msg":"finished node evaluation","controller_path":"/","controller_id":"","trace_id":"c1709a86c9849a4bc42d9586b31a673f","node_id":"cluster","duration":2580}
{"ts":"2024-09-04T20:14:23.784420467Z","level":"info","msg":"Using pod service account via in-cluster config","component_path":"/","component_id":"discovery.kubernetes.local_pods"}
{"ts":"2024-09-04T20:14:23.784886451Z","level":"info","msg":"finished node evaluation","controller_path":"/","controller_id":"","trace_id":"c1709a86c9849a4bc42d9586b31a673f","node_id":"discovery.kubernetes.local_pods","duration":877908}
{"ts":"2024-09-04T20:14:23.785026342Z","level":"info","msg":"finished node evaluation","controller_path":"/","controller_id":"","trace_id":"c1709a86c9849a4bc42d9586b31a673f","node_id":"discovery.process.all","duration":90800}
{"ts":"2024-09-04T20:14:23.785502667Z","level":"info","msg":"finished node evaluation","controller_path":"/","controller_id":"","trace_id":"c1709a86c9849a4bc42d9586b31a673f","node_id":"discovery.relabel.java_pods","duration":403204}
{"ts":"2024-09-04T20:14:23.786070292Z","level":"info","msg":"finished node evaluation","controller_path":"/","controller_id":"","trace_id":"c1709a86c9849a4bc42d9586b31a673f","node_id":"pyroscope.write.pyroscope_write","duration":518875}
{"ts":"2024-09-04T20:14:23.929998101Z","level":"info","msg":"finished node evaluation","controller_path":"/","controller_id":"","trace_id":"c1709a86c9849a4bc42d9586b31a673f","node_id":"pyroscope.java.java","duration":143866429}
{"ts":"2024-09-04T20:14:23.930232423Z","level":"info","msg":"finished node evaluation","controller_path":"/","controller_id":"","trace_id":"c1709a86c9849a4bc42d9586b31a673f","node_id":"remotecfg","duration":132201}
{"ts":"2024-09-04T20:14:23.930302133Z","level":"info","msg":"finished node evaluation","controller_path":"/","controller_id":"","trace_id":"c1709a86c9849a4bc42d9586b31a673f","node_id":"livedebugging","duration":25870}
{"ts":"2024-09-04T20:14:23.930370644Z","level":"info","msg":"finished node evaluation","controller_path":"/","controller_id":"","trace_id":"c1709a86c9849a4bc42d9586b31a673f","node_id":"ui","duration":14340}
{"ts":"2024-09-04T20:14:23.930394904Z","level":"info","msg":"finished complete graph evaluation","controller_path":"/","controller_id":"","trace_id":"c1709a86c9849a4bc42d9586b31a673f","duration":147237459}
{"ts":"2024-09-04T20:14:23.931033431Z","level":"info","msg":"scheduling loaded components and services"}
{"ts":"2024-09-04T20:14:23.931588126Z","level":"info","msg":"starting cluster node","service":"cluster","peers_count":0,"peers":"","advertise_addr":"127.0.0.1:12345"}
{"ts":"2024-09-04T20:14:23.932640515Z","level":"info","msg":"peers changed","service":"cluster","peers_count":1,"peers":"grafana-alloy-4pt7h"}
{"ts":"2024-09-04T20:14:23.933006819Z","level":"info","msg":"now listening for http traffic","service":"http","addr":"0.0.0.0:12345"}
{"ts":"2024-09-04T20:14:23.991554124Z","level":"error","msg":"failed to get process info","component_path":"/","component_id":"discovery.process.all","err":"readlink /proc/1/exe: permission denied","pid":1}
{"ts":"2024-09-04T20:14:23.991644175Z","level":"error","msg":"failed to get process info","component_path":"/","component_id":"discovery.process.all","err":"readlink /proc/2/exe: permission denied","pid":2}
{"ts":"2024-09-04T20:14:23.991677615Z","level":"error","msg":"failed to get process info","component_path":"/","component_id":"discovery.process.all","err":"readlink /proc/3/exe: permission denied","pid":3}
{"ts":"2024-09-04T20:14:23.991708775Z","level":"error","msg":"failed to get process info","component_path":"/","component_id":"discovery.process.all","err":"readlink /proc/4/exe: permission denied","pid":4}
{"ts":"2024-09-04T20:14:23.991749485Z","level":"error","msg":"failed to get process info","component_path":"/","component_id":"discovery.process.all","err":"readlink /proc/6/exe: permission denied","pid":6}
{"ts":"2024-09-04T20:14:23.991779286Z","level":"error","msg":"failed to get process info","component_path":"/","component_id":"discovery.process.all","err":"readlink /proc/8/exe: permission denied","pid":8}
{"ts":"2024-09-04T20:14:23.991807106Z","level":"error","msg":"failed to get process info","component_path":"/","component_id":"discovery.process.all","err":"readlink /proc/9/exe: permission denied","pid":9}
caspar-ds commented 1 month ago

Adding the following allows the container to read what it needs:

alloy:
  # ...
  securityContext:
    runAsUser: 0
    runAsNonRoot: false
    capabilities:
      add:
        - all

Is it documented anywhere which capabilities are required for Alloy to function?

Thanks

korniltsev commented 1 month ago

Yes, it is documented https://grafana.com/docs/alloy/latest/reference/components/pyroscope/pyroscope.java/#pyroscopejava

caspar-ds commented 1 month ago

Yes, it is documented https://grafana.com/docs/alloy/latest/reference/components/pyroscope/pyroscope.java/#pyroscopejava

Hi @korniltsev! The only thing I can see in that documentation is a note about requiring root and running inside the host pid namespace, but that is not necessarily sufficient for things to work if linux capabilities are enabled (of which they will usually be in any well-configured production environment).

Took a little trial and error, but we found that the following was sufficient for our use case (using Grafana Alloy only to scrape data for Pyroscope):

alloy:
  # ...
  securityContext:
    runAsUser: 0
    runAsNonRoot: false
    capabilities:
      add:
        - PERFMON
        - SYS_PTRACE
        - SYS_RESOURCE
        - SYS_ADMIN

Hopefully this issue will help anyone else running into the same problem.

korniltsev commented 1 month ago

We usually run it as "privileged" root. I agree we need to update docs