open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.07k stars 2.37k forks source link

root_path unexpected behavior #19761

Closed HansHabraken closed 11 months ago

HansHabraken commented 1 year ago

Component(s)

receiver/hostmetrics

What happened?

Description

Hi, we are currently running the Splunk distribution for the opentelemetry collector inside a docker container. To get the correct metrics from the host, we mount the entire filesystem of the host inside the container and specify the root_path configuration as described in the documentation. Running the collector container results in errors from the filesystem scrapper and it looks that the root_path and mountpoint are sometimes concatenated. I think I have traced this back to this code. Is this a bug or is there something missing in the documentation?

Steps to Reproduce

Expected Result

Should run without any errors

Actual Result

Filesystem scrapper logs following errors

docker-otel-collector-1  | 2023-03-17T10:24:45.434Z     error   scraperhelper/scrapercontroller.go:212  Error scraping metrics  {"kind": "receiver", "name": "hostmetrics", "data_type": "metrics", "error": "failed to read usage at /hostfs/hostfs/etc/cni/net.d: no such file or directory; failed to read usage at /hostfs/hostfs/etc/kubernetes: no such file or directory; failed to read usage at /hostfs/hostfs/usr/libexec/kubernetes/kubelet-plugins: no such file or directory; failed to read usage at /hostfs/hostfs/var/lib: no such file or directory; failed to read usage at /hostfs/etc/hostname: no such file or directory", "scraper": "filesystem"}
docker-otel-collector-1  | go.opentelemetry.io/collector/receiver/scraperhelper.(*controller).scrapeMetricsAndReport
docker-otel-collector-1  |      go.opentelemetry.io/collector@v0.72.0/receiver/scraperhelper/scrapercontroller.go:212
docker-otel-collector-1  | go.opentelemetry.io/collector/receiver/scraperhelper.(*controller).startScraping.func1
docker-otel-collector-1  |      go.opentelemetry.io/collector@v0.72.0/receiver/scraperhelper/scrapercontroller.go:191

Collector version

v0.72.0

Environment information

Environment

Container image: quay.io/signalfx/splunk-otel-collector:latest Host OS: MacOs Ventura 13.1 (M1)

OpenTelemetry Collector configuration


extensions:
  health_check:
    endpoint: 0.0.0.0:13133

  memory_ballast:
    size_mib: ${SPLUNK_BALLAST_SIZE_MIB}

receivers:
  hostmetrics:
    collection_interval: 10s
    root_path: /hostfs
    scrapers:
      cpu:
      disk:
      filesystem:
      memory:
      network:
      # System load average metrics https://en.wikipedia.org/wiki/Load_(computing)
      load:
      # Paging/Swap space utilization and I/O metrics
      paging:
      # Aggregated system process count metrics
      processes:
      # System processes metrics, disabled by default
      # process:

  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

  # This section is used to collect the OpenTelemetry Collector metrics
  # Even if just a Splunk APM customer, these metrics are included
  prometheus/internal:
    config:
      scrape_configs:
      - job_name: 'otel-collector'
        scrape_interval: 10s
        static_configs:
        - targets: ['0.0.0.0:8888']
        metric_relabel_configs:
          - source_labels: [ __name__ ]
            regex: '.*grpc_io.*'
            action: drop

  smartagent/signalfx-forwarder:
    type: signalfx-forwarder
    listenAddress: 0.0.0.0:9080

  smartagent/processlist:
    type: processlist

  signalfx:
    endpoint: 0.0.0.0:9943

processors:
  batch:

  memory_limiter:
    check_interval: 2s
    limit_mib: ${SPLUNK_MEMORY_LIMIT_MIB}

  resourcedetection:
    detectors: [gcp, ecs, ec2, azure, system]
    override: true

  resource/dimensions:
    attributes:
      - key: environment
        action: insert
        value: "${ENV}"
      - key: deployment.environment
        action: insert
        value: "${ENV}"
      - key: env
        action: insert
        value: "${ENV}"
      - key: sf_environment
        action: insert
        value: "${ENV}"
      - key: aws_region
        value: "${AWS_REGION}"
        action: insert
      - key: aws_accont_id
        value: "${AWS_ACCOUNT_ID}"
        action: insert
      - key: app_group
        value: "${APP_GROUP}"
        action: insert
      - key: app
        value: "${APP}"
        action: insert
      - key: sf_service
        value: "${APP}"
        action: insert

  resource/tags:
    attributes:
      - key: environment
        action: insert
        value: "${ENV}"
      - key: deployment.environment
        action: insert
        value: "${ENV}"
      - key: env
        action: insert
        value: "${ENV}"
      - key: sf_environment
        action: insert
        value: "${ENV}"
      - key: aws_region
        value: "${AWS_REGION}"
        action: insert
      - key: aws_accont_id
        value: "${AWS_ACCOUNT_ID}"
        action: insert
      - key: app_group
        value: "${APP_GROUP}"
        action: insert
      - key: app
        value: "${APP}"
        action: insert
      - key: sf_service
        value: "${APP}"
        action: insert

exporters:
  # Traces
  sapm:
    access_token: "${SPLUNK_TRACE_ACCESS_TOKEN}"
    endpoint: "${SPLUNK_TRACE_URL}"

  # Metrics + Events
  signalfx:
    access_token: "${SPLUNK_METRIC_ACCESS_TOKEN}"
    api_url: "${SPLUNK_API_URL}"
    ingest_url: "${SPLUNK_INGEST_URL}"
    sync_host_metadata: true
    correlation:

  # Can be used as exporter to send metrics, traces and logs to stdout
  logging:
    loglevel: info

service:
  extensions: [health_check, memory_ballast]
  pipelines:
    traces:
      receivers: [otlp, smartagent/signalfx-forwarder]
      processors: [memory_limiter, batch, resourcedetection, resource/tags]
      exporters: [sapm, signalfx]

    metrics:
      receivers: [hostmetrics, otlp, signalfx, smartagent/signalfx-forwarder]
      processors: [memory_limiter, batch, resourcedetection, resource/dimensions]
      exporters: [signalfx]

    metrics/internal:
      receivers: [prometheus/internal]
      processors: [memory_limiter, batch, resourcedetection]
      exporters: [signalfx]

Log output

No response

Additional context

No response

github-actions[bot] commented 1 year ago

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

dmitryax commented 1 year ago

Hi @HansHabraken. Thanks for reporting the issue.

I cannot reproduce it yet. I don't get the doubled /hostfs prefix. Do you run it locally with docker run or another way? Can you please provide more details that can help to reproduce it?

HansHabraken commented 1 year ago

Hi @dmitryax, Yes sure. We use docker compose to run our application locally together with the collection. Here is our docker-compose.yaml

services:
    application:
        image: applicaitionImage:latest
        ports:
            - 1234:1234
            - ...
        environment:
            OTEL_JAVAAGENT_DEBUG: true
            OTEL_SERVICE_NAME: application-name
            SPLUNK_METRICS_ENABLED: false
            OTEL_RESOURCE_ATTRIBUTES: service.name=application-name,deployment.environment=test
            ...
    otel-collector:
        image: quay.io/signalfx/splunk-otel-collector:latest
        command: [--config=/etc/splunk-otel-collector-config.yaml]
        volumes:
            - /:/hostfs
            - /tmp/splunk-otel-collector-config.yaml:/etc/splunk-otel-collector-config.yaml
        network_mode: "service:application"
        environment:
            SPLUNK_TRACE_ACCESS_TOKEN: ${SPLUNK_TRACE_ACCESS_TOKEN}
            SPLUNK_METRIC_ACCESS_TOKEN: ${SPLUNK_METRIC_ACCESS_TOKEN}
            ...
github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 1 year ago

This issue has been closed as inactive because it has been stale for 120 days with no activity.

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 11 months ago

This issue has been closed as inactive because it has been stale for 120 days with no activity.