rootless ICMP ping - Githubissues

darkk commented 7 years ago

x/net/icmp supports root-less operation for ICMP pings on Linux and MacOSX, but blackbox_exporter requires elevated privileges for that.

Are there any non-obvious blockers for using rootless ping sockets? I've looked at the code and I've not noticed any.

brian-brazil commented 7 years ago

Those docs seem to be confusing UDP and ICMP.

darkk commented 7 years ago

Maybe the doc is wrong, maybe it says "udp" as IPPROTO_ICMP socket is created with SOCK_DGRAM as a second argument (like UDP socket using 0 instead of IPPROTO_ICMP as a third arg), but golang/go#9166 explicitly tells that the feature is implemented.

brian-brazil commented 7 years ago

The docs are definitely wrong, the only reference to ICMP is for privileged sockets which means root.

I think we need more clarity here, and knowing which kernels support this.

darkk commented 7 years ago

which kernels support this

AFAIK, it's mainlined since v2.6.39.

I have not figured out earliest MacOSX version supporting socket(AF_INET, SOCK_DGRAM, IPPROTO_ICMP) but it was already supported when the patch was merged to Linux mainline.

brian-brazil commented 7 years ago

So introduced just under 6 years ago, that's relatively new as kernel features go.

darkk commented 7 years ago

The docs are definitely wrong

The docs match test code at https://github.com/golang/net/blob/master/icmp/ping_test.go#L61

brian-brazil commented 7 years ago

The docs indicate you need privileged access to use this feature: "For privileged raw ICMP endpoints, network must be "ip4" or "ip6" followed by a colon and an ICMP protocol number or name."

darkk commented 7 years ago

The part of the doc that is relevant to the ticket is located a couple of paragraphs above:

For non-privileged datagram-oriented ICMP endpoints, network must be "udp4" or "udp6". The endpoint allows to read, write a few limited ICMP messages such as echo request and echo reply. Currently only Darwin and Linux support this.

It also needs some privileges, running process should be in the group within net.ipv4.ping_group_range, but it's much lower amount of privileges than cap_net_raw capability.

Please, take a look at icmp/ping_test.go and icmp/listen_posix.go before repeating that docs confuse UDP and ICMP.

brian-brazil commented 7 years ago

For non-privileged datagram-oriented ICMP endpoints, network must be "udp4" or "udp6".

This line as written confuses ICMP and UDP, and the following example only mentions UDP. My interpretation is that this is probably a typo and that ICMP should be UDP.

Please, take a look at icmp/ping_test.go and icmp/listen_posix.go before repeating that docs confuse UDP and ICMP.

If I need to read source code to see what the docs actually mean, then the docs are confusing and/or wrong.

It also needs some privileges, running process should be in the group within net.ipv4.ping_group_range, but it's much lower amount of privileges than cap_net_raw capability.

That's not of too much use then. A feature that only works on newer kernels and requires additional setup doesn't win over SUID or NET_ADMIN which work ~everywhere.

SuperQ commented 7 years ago

We should document both the setcap CAP_NET_RAW and sysctl net.ipv4.ping_group_range methods of giving the exporter access to the privileges needed.

The only major distribution with a kernel older than 2.6.39 is RHEL6. RHEL7 has been out since 2014.

brian-brazil commented 7 years ago

and sysctl net.ipv4.ping_group_range methods of giving the exporter access to the privileges needed.

This would also require code changes, this is a different API.

SuperQ commented 7 years ago

Right, but we could handle that with fallback detection.

brian-brazil commented 7 years ago

The only major distribution with a kernel older than 2.6.39 is RHEL6.

Centos 6.9 was released last month, is supported until 2020 and comes with 2.6.32.

We have users on older systems (including at least one that can't even run Go out of the box as their kernel is so old). I'm wary of adding features that require tweaking sysctls to work, and don't work for everyone as that's a non-trivial amount of cognitive overhead. Users will want this to just work out of the box, and I suspect this will also be Fun with containers.

We already have two documented ways to make this work on Linux, why should we add a third that doesn't work for everyone?

SuperQ commented 7 years ago

Users already have to tweak setcap or setuid, so adding a 3rd option isn't any different. The newer variation on the syscal to send ICMP packets is safer, as raw socket access is not required. The goal here is to reduce the attack surface area.

SuperQ commented 7 years ago

@brian-brazil So, after reading the docs, and the source, I think I understand your confusion.

There are two modes of operation unprivileged, and privileged for ICMP ListenPacket. The u in udp4 doesn't stand for UDP, it stands for "Unprivileged".

The docs aren't wrong, they're just a little confusing since they aren't verbose about what the connection type strings stand for.

Either way, that's just a distraction from the real issue. We should attempt to use unprivileged ListenPacket and fall back to privileged automatically.

sasah commented 5 years ago

+1

jcollie commented 3 years ago

FWIW, I was able to finally get non-root ICMP pings working with blackbox-exporter. The key was setting net.ipv4.ping_group_range as part of the pod security context. No other combinations of adding NET_RAW, groups, custom containers that had setcap cap_net_raw+ep on the binary worked (except for running as root).

net.ipv4.ping_group_range is namespaced so that changing it as part of the pod won't affect other parts of the system.

My testing was done with blackbox exporter 0.19.0, CentOS 7, kernel 3.10.0-1160.31.1.el7.x86_64, and Kubernetes 1.21.2.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: blackbox-exporter
spec:
  replicas: 6
  selector:
    matchLabels:
      app: blackbox-exporter
  template:
    metadata:
      labels:
        app: blackbox-exporter
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 1
              podAffinityTerm:
                labelSelector:
                  matchExpressions:
                    - key: app
                      operator: In
                      values:
                        - blackbox-exporter
                topologyKey: "kubernetes.io/hostname"
      securityContext:
        sysctls:
          - name: net.ipv4.ping_group_range
            value: "0 2147483647"
      containers:
        - name: blackbox-exporter
          image: docker.io/prom/blackbox-exporter:v0.19.0
          ports:
            - name: metrics
              containerPort: 9115
              protocol: TCP
          volumeMounts:
            - name: config
              mountPath: /etc/blackbox_exporter
          resources:
            requests:
              cpu: "500m"
              memory: "512Mi"
            limits:
              cpu: "500m"
              memory: "512Mi"
          securityContext:
            runAsUser: 49172
            runAsGroup: 49172
            runAsNonRoot: true
            readOnlyRootFilesystem: true
            allowPrivilegeEscalation: false
            capabilities:
              drop:
                - ALL
      volumes:
        - name: config
          configMap:
            name: blackbox-exporter

abctaylor commented 6 months ago

Working values.yaml below (plus an ingress). Appreciate it's long but it should work in 2024...

global:
  ## Global image registry to use if it needs to be overriden for some specific use cases (e.g local registries, custom images, ...)
  ##
  imageRegistry: ""

restartPolicy: Always

kind: Deployment

## Override the namespace
##
namespaceOverride: ""

# Override Kubernetes version if your distribution does not follow semver v2
kubeVersionOverride: ""

## set to true to add the release label so scraping of the servicemonitor with kube-prometheus-stack works out of the box
releaseLabel: false

podDisruptionBudget: {}
  # maxUnavailable: 0

## Allow automount the serviceaccount token for sidecar container (eg: oauthproxy)
automountServiceAccountToken: false

## Additional blackbox-exporter container environment variables
## For instance to add a http_proxy
##
## extraEnv:
##   HTTP_PROXY: "http://superproxy.com:3128/"
##   NO_PROXY: "localhost,127.0.0.1"
extraEnv: {}

## Additional blackbox-exporter container environment variables for secret
## extraEnvFromSecret:
##   - secretOne
##   - secretTwo
extraEnvFromSecret: ""

extraVolumes: []
  # - name: secret-blackbox-oauth-htpasswd
  #   secret:
  #     defaultMode: 420
  #     secretName: blackbox-oauth-htpasswd
  # - name: storage-volume
  #   persistentVolumeClaim:
  #     claimName: example

## Additional volumes that will be attached to the blackbox-exporter container
extraVolumeMounts:
  # - name: ca-certs
  #   mountPath: /etc/ssl/certs/ca-certificates.crt

## Additional InitContainers to initialize the pod
## This supports either a structured array or a templatable string
extraInitContainers: []

## This supports either a structured array or a templatable string

# Array mode
extraContainers: []
  # - name: oAuth2-proxy
  #   args:
  #     - -https-address=:9116
  #     - -upstream=http://localhost:9115
  #     - -skip-auth-regex=^/metrics
  #     - -openshift-delegate-urls={"/":{"group":"monitoring.coreos.com","resource":"prometheuses","verb":"get"}}
  #   image: openshift/oauth-proxy:v1.1.0
  #   ports:
  #       - containerPort: 9116
  #         name: proxy
  #   resources:
  #       limits:
  #         memory: 16Mi
  #       requests:
  #         memory: 4Mi
  #         cpu: 20m
  #   volumeMounts:
  #     - mountPath: /etc/prometheus/secrets/blackbox-tls
  #       name: secret-blackbox-tls

# String mode
# extraContainers: |-
#   - name: oAuth2-proxy
#     args:
#       - -https-address=:9116
#       - -upstream=http://localhost:9115
#       - -skip-auth-regex=^/metrics
#       - -openshift-delegate-urls={"/":{"group":"monitoring.coreos.com","resource":"prometheuses","verb":"get"}}
#     image: {{ .Values.global.imageRegistry }}/openshift/oauth-proxy:v1.1.0

## Enable pod security policy
pspEnabled: true

hostNetwork: false

strategy:
  rollingUpdate:
    maxSurge: 1
    maxUnavailable: 0
  type: RollingUpdate

image:
  registry: quay.io
  repository: prometheus/blackbox-exporter
  # Overrides the image tag whose default is {{ printf "v%s" .Chart.AppVersion }}
  tag: ""
  pullPolicy: IfNotPresent
  digest: ""

  ## Optionally specify an array of imagePullSecrets.
  ## Secrets must be manually created in the namespace.
  ## ref: https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/
  ##
  # pullSecrets:
  #   - myRegistrKeySecretName

podSecurityContext:
  sysctls:
    - name: net.ipv4.ping_group_range
      value: "0 2147483647"
# fsGroup: 1000

## User and Group to run blackbox-exporter container as
securityContext:
  runAsUser: 1000
  runAsGroup: 1000
  readOnlyRootFilesystem: true
  runAsNonRoot: true
  allowPrivilegeEscalation: false
  capabilities:
    drop: ["ALL"]
# Add NET_RAW to enable ICMP
    add: ["NET_RAW"]

livenessProbe:
  httpGet:
    path: /-/healthy
    port: http
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /-/healthy
    port: http

nodeSelector: {}
tolerations: []
affinity: {}

## Topology spread constraints rely on node labels to identify the topology domain(s) that each Node is in.
## Ref: https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/
topologySpreadConstraints: []
  # - maxSkew: 1
  #   topologyKey: failure-domain.beta.kubernetes.io/zone
  #   whenUnsatisfiable: DoNotSchedule
  #   labelSelector:
  #     matchLabels:
#       app.kubernetes.io/instance: jiralert

# if the configuration is managed as secret outside the chart, using SealedSecret for example,
# provide the name of the secret here. If secretConfig is set to true, configExistingSecretName will be ignored
# in favor of the config value.
configExistingSecretName: ""
# Store the configuration as a `Secret` instead of a `ConfigMap`, useful in case it contains sensitive data
secretConfig: false
config:
  modules:
    http_2xx:
      prober: http
      timeout: 5s
      http:
        valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
        follow_redirects: true
        preferred_ip_protocol: "ip4"
    icmp:
      prober: icmp
      icmp:
        preferred_ip_protocol: ip4
allowIcmp: true
# Set custom config path, other than default /config/blackbox.yaml. If let empty, path will be "/config/blackbox.yaml"
# configPath: "/foo/bar"

extraConfigmapMounts: []
  # - name: certs-configmap
  #   mountPath: /etc/secrets/ssl/
  #   subPath: certificates.crt # (optional)
  #   configMap: certs-configmap
  #   readOnly: true
  #   defaultMode: 420

## Additional secret mounts
# Defines additional mounts with secrets. Secrets must be manually created in the namespace.
extraSecretMounts: []
  # - name: secret-files
  #   mountPath: /etc/secrets
  #   secretName: blackbox-secret-files
  #   readOnly: true
  #   defaultMode: 420

resources: {}
  # limits:
  #   memory: 300Mi
  # requests:
  #   memory: 50Mi

priorityClassName: ""

service:
  annotations: {}
  labels: {}
  type: ClusterIP
  port: 9115
  ipDualStack:
    enabled: false
    ipFamilies: ["IPv6", "IPv4"]
    ipFamilyPolicy: "PreferDualStack"

# Only changes container port. Application port can be changed with extraArgs (--web.listen-address=:9115)
# https://github.com/prometheus/blackbox_exporter/blob/998037b5b40c1de5fee348ffdea8820509d85171/main.go#L55
containerPort: 9115

# Number of port to expose on the host. If specified, this must be a valid port number, 0 < x < 65536. If zero, no port is exposed.
# This is useful for communicating with Daemon Pods when kind is DaemonSet.
hostPort: 0

serviceAccount:
  # Specifies whether a ServiceAccount should be created
  create: true
  # The name of the ServiceAccount to use.
  # If not set and create is true, a name is generated using the fullname template
  name:
  annotations: {}

## An Ingress resource can provide name-based virtual hosting and TLS
## termination among other things for CouchDB deployments which are accessed
## from outside the Kubernetes cluster.
## ref: https://kubernetes.io/docs/concepts/services-networking/ingress/
ingress:
  enabled: true
  className: "nginx"
  labels: {}
  annotations:
    kubernetes.io/ingress.class: nginx
    # kubernetes.io/tls-acme: "true"
  hosts:
    ## The host property on hosts and tls is passed through helm tpl function.
    ## ref: https://helm.sh/docs/developing_charts/#using-the-tpl-function
    - host: blackbox-exporter.core.example.net
      paths:
        - path: /
          pathType: ImplementationSpecific
    - host: blackbox-exporter
      paths:
        - path: /
          pathType: ImplementationSpecific
  tls:
   - secretName: cert-blackbox-exporter.core.example.net
     hosts:
       - blackbox-exporter.core.example.net
       - blackbox-exporter
podAnnotations: {}

# Annotations for the Deployment
deploymentAnnotations: {}

# Annotations for the Secret
secretAnnotations: {}

# Hostaliases allow to add additional DNS entries to be injected directly into pods.
# This will take precedence over your implemented DNS solution
hostAliases: []
#  - ip: 192.168.1.1
#    hostNames:
#      - test.example.com
#      - another.example.net

pod:
  labels: {}

extraArgs: []
  # - --history.limit=1000

replicas: 1

serviceMonitor:
  ## If true, a ServiceMonitor CRD is created for a prometheus operator
  ## https://github.com/coreos/prometheus-operator for blackbox-exporter itself
  ##
  selfMonitor:
    enabled: false
    additionalMetricsRelabels: {}
    additionalRelabeling: []
    labels: {}
    path: /metrics
    scheme: http
    tlsConfig: {}
    interval: 30s
    scrapeTimeout: 30s
    ## Port can be defined by assigning a value for the port key below
    ## port:

  ## If true, a ServiceMonitor CRD is created for a prometheus operator
  ## https://github.com/coreos/prometheus-operator for each target
  ##
  enabled: false

  # Default values that will be used for all ServiceMonitors created by `targets`
  defaults:
    additionalMetricsRelabels: {}
    additionalRelabeling: []
    labels: {}
    interval: 30s
    scrapeTimeout: 30s
    module: http_2xx
  ## scheme: HTTP scheme to use for scraping. Can be used with `tlsConfig` for example if using istio mTLS.
  scheme: http
  ## path: HTTP path. Needs to be adjusted, if web.route-prefix is set
  path: "/probe"
  ## tlsConfig: TLS configuration to use when scraping the endpoint. For example if using istio mTLS.
  ## Of type: https://github.com/coreos/prometheus-operator/blob/master/Documentation/api.md#tlsconfig
  tlsConfig: {}
  bearerTokenFile:

  targets:
#    - name: example                    # Human readable URL that will appear in Prometheus / AlertManager
#      url: http://example.com/healthz  # The URL that blackbox will scrape
#      hostname: example.com            # HTTP probes can accept an additional `hostname` parameter that will set `Host` header and TLS SNI
#      labels: {}                       # Map of labels for ServiceMonitor. Overrides value set in `defaults`
#      interval: 60s                    # Scraping interval. Overrides value set in `defaults`
#      scrapeTimeout: 60s               # Scrape timeout. Overrides value set in `defaults`
#      module: http_2xx                 # Module used for scraping. Overrides value set in `defaults`
#      additionalMetricsRelabels: {}    # Map of metric labels and values to add
#      additionalRelabeling: []         # List of metric relabeling actions to run

## Custom PrometheusRules to be defined
## ref: https://github.com/coreos/prometheus-operator#customresourcedefinitions
prometheusRule:
  enabled: false
  additionalLabels: {}
  namespace: ""
  rules: []

podMonitoring:
  ## If true, a PodMonitoring CR is created for google managed prometheus
  ## https://cloud.google.com/stackdriver/docs/managed-prometheus/setup-managed#gmp-pod-monitoring for blackbox-exporter itself
  ##
  selfMonitor:
    enabled: false
    additionalMetricsRelabels: {}
    labels: {}
    path: /metrics
    interval: 30s
    scrapeTimeout: 30s

  ## If true, a PodMonitoring CR is created for a google managed prometheus
  ## https://cloud.google.com/stackdriver/docs/managed-prometheus/setup-managed#gmp-pod-monitoring for each target
  ##
  enabled: false

  ## Default values that will be used for all PodMonitoring created by `targets`
  ## Following PodMonitoring API specs https://github.com/GoogleCloudPlatform/prometheus-engine/blob/main/doc/api.md#scrapeendpoint
  defaults:
    additionalMetricsRelabels: {}
    labels: {}
    interval: 30s
    scrapeTimeout: 30s
    module: http_2xx
  ## scheme: Protocol scheme to use to scrape.
  scheme: http
  ## path: HTTP path. Needs to be adjusted, if web.route-prefix is set
  path: "/probe"
  ## tlsConfig: TLS configuration to use when scraping the endpoint. For example if using istio mTLS.
  ## Of type: https://github.com/coreos/prometheus-operator/blob/master/Documentation/api.md#tlsconfig
  tlsConfig: {}

  targets:
#    - name: example                    # Human readable URL that will appear in Google Managed Prometheus / AlertManager
#      url: http://example.com/healthz  # The URL that blackbox will scrape
#      hostname: example.com            # HTTP probes can accept an additional `hostname` parameter that will set `Host` header and TLS SNI
#      labels: {}                       # Map of labels for PodMonitoring. Overrides value set in `defaults`
#      interval: 60s                    # Scraping interval. Overrides value set in `defaults`
#      scrapeTimeout: 60s               # Scrape timeout. Overrides value set in `defaults`
#      module: http_2xx                 # Module used for scraping. Overrides value set in `defaults`
#      additionalMetricsRelabels: {}    # Map of metric labels and values to add

## Network policy for chart
networkPolicy:
  # Enable network policy and allow access from anywhere
  enabled: false
  # Limit access only from monitoring namespace
  # Before setting this value to true, you must add the name=monitoring label to the monitoring namespace
  # Network Policy uses label filtering
  allowMonitoringNamespace: false

## dnsPolicy and dnsConfig for Deployments and Daemonsets if you want non-default settings.
## These will be passed directly to the PodSpec of same.
dnsPolicy:
dnsConfig:

# Extra manifests to deploy as an array
extraManifests: []
  # - apiVersion: v1
  #   kind: ConfigMap
  #   metadata:
  #   labels:
  #     name: prometheus-extra
  #   data:
  #     extra-data: "value"

# global common labels, applied to all ressources
commonLabels: {}

# Enable vertical pod autoscaler support for prometheus-blackbox-exporter
verticalPodAutoscaler:
  enabled: false

  # Recommender responsible for generating recommendation for the object.
  # List should be empty (then the default recommender will generate the recommendation)
  # or contain exactly one recommender.
  # recommenders:
  # - name: custom-recommender-performance

  # List of resources that the vertical pod autoscaler can control. Defaults to cpu and memory
  controlledResources: []
  # Specifies which resource values should be controlled: RequestsOnly or RequestsAndLimits.
  # controlledValues: RequestsAndLimits

  # Define the max allowed resources for the pod
  maxAllowed: {}
  # cpu: 200m
  # memory: 100Mi
  # Define the min allowed resources for the pod
  minAllowed: {}
  # cpu: 200m
  # memory: 100Mi

  updatePolicy:
    # Specifies minimal number of replicas which need to be alive for VPA Updater to attempt pod eviction
    # minReplicas: 1
    # Specifies whether recommended updates are applied when a Pod is started and whether recommended updates
    # are applied during the life of a Pod. Possible values are "Off", "Initial", "Recreate", and "Auto".
    updateMode: Auto

configReloader:
  enabled: false
  containerPort: 8080
  config:
    logFormat: logfmt
    logLevel: info
    watchInterval: 1m
  image:
    registry: quay.io
    repository: prometheus-operator/prometheus-config-reloader
    tag: "v0.71.2"
    pullPolicy: IfNotPresent
    digest: ""
  securityContext:
    runAsUser: 1000
    runAsGroup: 1000
    readOnlyRootFilesystem: true
    runAsNonRoot: true
    allowPrivilegeEscalation: false
    capabilities:
      drop: ["ALL"]
  resources:
    limits:
      memory: 50Mi
    requests:
      cpu: 10m
      memory: 20Mi
  livenessProbe:
    httpGet:
      path: /healthz
      port: reloader-web
      scheme: HTTP
  readinessProbe:
    httpGet:
      path: /healthz
      port: reloader-web
      scheme: HTTP
  service:
    port: 8080
  serviceMonitor:
    selfMonitor:
      additionalMetricsRelabels: {}
      additionalRelabeling: []
      path: /metrics
      scheme: http
      tlsConfig: {}
      interval: 30s
      scrapeTimeout: 30s

vdksystem commented 4 months ago

Looks like it doesn't work in 2024, Had to go with root to make it work

prometheus / blackbox_exporter

rootless ICMP ping #147