vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
16.93k stars 1.46k forks source link

Host_metrics throwing errors #18916

Open adolsalamanca opened 8 months ago

adolsalamanca commented 8 months ago

A note for the community

Problem

We're running vector as a daemonset in our k8s cluster. All of a sudden we started getting errors about host_metrics, we're only using cgroups collector on it.

2023-10-24T07:12:05.350924Z ERROR source{component_kind="source" component_id=cgroup_metrics component_type=host_metrics component_name=cgroup_metrics}: vector::internal_events::host_metrics:
Failed to load cgroups children. error=No such file or directory (os error 2) error_type="reader_failed" stage="receiving" internal_log_rate_limit=true

Configuration

---
data_dir: /vector-data-dir
api:
  enabled: true
  address: 0.0.0.0:8686
  playground: false
expire_metrics_secs: 86400 # 1 day in seconds
sources:
  vector_internal_metrics:
    type: internal_metrics

  cgroup_metrics:
    type: host_metrics
    collectors:
      - cgroups
    scrape_interval_secs: 1

  storage_metrics:
    type: host_metrics
    collectors:
      - filesystem
    filesystem:
      mountpoints:
        includes:
          - "/var/lib/kubelet/pods/*pvc*"
    scrape_interval_secs: 1

transforms:
  cgroup_metric_to_log:
    type: metric_to_log
    inputs:
      - cgroup_metrics

  cgroup_remap:
    type: remap
    inputs:
      - cgroup_metric_to_log
    source: |
      .container_id = "null"
      data, err = parse_regex(.tags.cgroup, r'.*/cri-containerd-(?P<container_id>.*)\.scope')
      if err == null {
        .container_id = data.container_id
      }

      if exists(.counter) {
        .metrics.type = "counter"
        .metrics.value = .counter.value
        del(.counter)
      }
      if exists(.gauge) {
        .metrics.type = "gauge"
        .metrics.value = .gauge.value
        del(.gauge)
      }

      .cgroup_type = "standard"

  cgroup_filter:
    type: filter
    inputs:
      - cgroup_remap
    condition: .container_id != "null"

  cgroup_router:
    type: route
    inputs:
      - cgroup_filter
    route:
      standard: .cgroup_type == "standard"

  storage_metric_to_log:
    type: metric_to_log
    inputs:
      - storage_metrics

  storage_filter:
    type: filter
    inputs:
      - storage_metric_to_log
    condition: .name == "filesystem_free_bytes" || .name == "filesystem_total_bytes" || .name == "filesystem_used_bytes"

  storage_remap_1:
    type: remap
    inputs:
      - storage_filter
    source: |
      # Parse the PVC ID.
      data, err = parse_regex(.tags.mountpoint, r'.+(?P<pvc_id>pvc-.+)/mount')
      assert!(err == null, message: "unable to parse PVC ID")
      .pvc_id = data.pvc_id
      # Grab tags we know are on the event.
      .filesystem = .tags.device
      .mounted_on = .tags.mountpoint
      # Get the available, used, and total storage.
      # The zero values will be overridden by a reduce since the default strategy is to sum.
      .available = 0
      .used = 0
      .total = 0
      if .name == "filesystem_free_bytes" {
        .available = to_int!(.gauge.value)
      }
      if .name == "filesystem_total_bytes" {
        .total = to_int!(.gauge.value)
      }
      if .name == "filesystem_used_bytes" {
        .used = to_int!(.gauge.value)
      }
      # Delete unneeded fields.
      del(.namespace)
      del(.kind)
      del(.gauge)
      del(.tags)
      del(.name)

  # We reduce the various storage metrics so the metric values are merged together.
  storage_reduce:
    type: reduce
    inputs:
      - storage_remap_1
    group_by:
      - timestamp
      - pvc_id
    expire_after_ms: 1000

sinks:
  prometheus_exporter:
    type: prometheus_exporter
    inputs:
      - vector_internal_metrics
    address: 0.0.0.0:9090

  cgroups:
    type: http
    inputs:
      - cgroup_router.standard
    uri: http://svc.cluster.local:8080/api/cgroups
    encoding:
      codec: json
    buffer:
      type: disk
      max_size: 268435488
      when_full: block
    batch:
      max_bytes: 15000000
      max_events: 250
      timeout_secs: 1.5
    healthcheck:
      enabled: true
      uri: http://svc.cluster.local:8080/api/healthcheck

  storage:
    type: http
    inputs:
      - storage_remap_2
    uri: http://svc.cluster.local:8080/api/storage
    encoding:
      codec: json
    buffer:
      type: disk
      max_size: 268435488
      when_full: block
    batch:
      max_bytes: 15000000
      max_events: 250
      timeout_secs: 1.5
    healthcheck:
      enabled: true
      uri: http://svc.cluster.local:8080/api/healthcheck

Version

timberio/vector:0.27.X-alpine

Debug Output

I couldn't get debug info, once I applied changes when errors appeared no additional info was displayed.

Example Data

No response

Additional Context

No response

References

No response

StephenWakely commented 8 months ago

I can't reproduce the issue if I just run timberio/vector:0.27.X-alpine in Docker.

By the sounds of it the source isn't able to access /sys/fs/cgroup. It's possible Kubernetes is doing something funky with this. Are you able to run a shell inside the pod and ls /sys/fs/cgroup?

adolsalamanca commented 8 months ago

Thanks for your response @StephenWakely Yeah, it's not something super common, just happened a few times since we started scrapping host metrics. I'll try to run this if the error appears once again, and keep you posted.

bruceg commented 7 months ago

I wonder if it is an issue of the cgroups structure changing while we are trying to read the files. That would introduce a race condition that would produce those "file not found" errors. Do you know if this happens more frequently when the contents of the pods are changing, @adolsalamanca?

adolsalamanca commented 7 months ago

Thanks for your message @bruceg

Sounds like a reasonable assumption, but I don't have much data to assess it right now. All I could do is wait for the problem to appear again, check and compare IOPS of that node with others.

It wasn't the first time, but didn't happen again since we reported it 😞

bruceg commented 7 months ago

To be clear, it's not so much a matter of IOPS but that of pods changing, that is containers being started or stopped. From comparing your error to what is in the source code, I can definitely see the possibility for this to be happening, I'm just curious if that is what is happening to you.

adolsalamanca commented 7 months ago

Thanks for your response @bruceg Doesn't seem to be something that happens that often, below results getting agents from two of our clusters:

kubectl get pods -l app.kubernetes.io/instance=cgroups-metrics
NAME                    READY   STATUS    RESTARTS      AGE
cgroups-metrics-2dsbl   1/1     Running   0             66d
cgroups-metrics-2kvzl   1/1     Running   0             66d
cgroups-metrics-2mpb5   1/1     Running   0             66d
cgroups-metrics-42r7p   1/1     Running   0             66d
cgroups-metrics-4cpd6   1/1     Running   0             66d
cgroups-metrics-4hrrb   1/1     Running   0             16d
cgroups-metrics-4ktmm   1/1     Running   0             66d
cgroups-metrics-4ns5r   1/1     Running   0             50d
cgroups-metrics-4p88m   1/1     Running   0             66d
cgroups-metrics-4pk8t   1/1     Running   0             66d
cgroups-metrics-5d7vv   1/1     Running   0             66d
cgroups-metrics-5nbg4   1/1     Running   0             66d
cgroups-metrics-64zcj   1/1     Running   0             66d
cgroups-metrics-6756r   1/1     Running   0             66d
cgroups-metrics-6h5bv   1/1     Running   0             66d
cgroups-metrics-6tbb2   1/1     Running   0             66d
cgroups-metrics-6zfv5   1/1     Running   0             66d
cgroups-metrics-78ftx   1/1     Running   0             45d
cgroups-metrics-8vkml   1/1     Running   0             66d
cgroups-metrics-9569m   1/1     Running   0             66d
cgroups-metrics-9kxht   1/1     Running   0             66d
cgroups-metrics-9mjxb   1/1     Running   0             66d
cgroups-metrics-9tnsw   1/1     Running   0             66d
cgroups-metrics-b4m64   1/1     Running   0             66d
cgroups-metrics-bgdxb   1/1     Running   0             66d
cgroups-metrics-bqttv   1/1     Running   0             43d
cgroups-metrics-br9n2   1/1     Running   0             8d
cgroups-metrics-brd98   1/1     Running   0             66d
cgroups-metrics-c25pp   1/1     Running   0             66d
cgroups-metrics-c4mzd   1/1     Running   0             66d
cgroups-metrics-cftfh   1/1     Running   0             66d
cgroups-metrics-cqv4x   1/1     Running   0             66d
cgroups-metrics-crcqg   1/1     Running   0             66d
cgroups-metrics-csrhd   1/1     Running   0             66d
cgroups-metrics-ctft9   1/1     Running   0             66d
cgroups-metrics-djhs2   1/1     Running   0             66d
cgroups-metrics-dkdd8   1/1     Running   0             66d
cgroups-metrics-fr4b2   1/1     Running   0             66d
cgroups-metrics-fz44n   1/1     Running   0             66d
cgroups-metrics-gczlh   1/1     Running   0             66d
cgroups-metrics-ghzvq   1/1     Running   0             66d
cgroups-metrics-gnjv7   1/1     Running   0             66d
cgroups-metrics-gp4zk   1/1     Running   0             14d
cgroups-metrics-h2v4p   1/1     Running   0             43d
cgroups-metrics-hc6tt   1/1     Running   0             26d
cgroups-metrics-hgl6r   1/1     Running   0             66d
cgroups-metrics-hjx2p   1/1     Running   0             66d
cgroups-metrics-hzmh9   1/1     Running   0             42d
cgroups-metrics-j9zqs   1/1     Running   0             66d
cgroups-metrics-jccp4   1/1     Running   0             66d
cgroups-metrics-jxc4x   1/1     Running   0             45d
cgroups-metrics-k4g8r   1/1     Running   0             6d3h
cgroups-metrics-k9d78   1/1     Running   0             66d
cgroups-metrics-kddk6   1/1     Running   0             66d
cgroups-metrics-kvnnm   1/1     Running   0             66d
cgroups-metrics-l2tql   1/1     Running   0             66d
cgroups-metrics-l8clx   1/1     Running   0             66d
cgroups-metrics-lh8hf   1/1     Running   0             6d3h
cgroups-metrics-lk842   1/1     Running   0             24d
cgroups-metrics-ls8qd   1/1     Running   0             66d
cgroups-metrics-lv7f6   1/1     Running   0             66d
cgroups-metrics-m8jzh   1/1     Running   0             66d
cgroups-metrics-mmsh7   1/1     Running   0             66d
cgroups-metrics-mnxtp   1/1     Running   0             43d
cgroups-metrics-mtz52   1/1     Running   0             23d
cgroups-metrics-mvkdj   1/1     Running   0             66d
cgroups-metrics-n9qnt   1/1     Running   0             66d
cgroups-metrics-npfd5   1/1     Running   0             66d
cgroups-metrics-nxwvn   1/1     Running   0             66d
cgroups-metrics-p4zxj   1/1     Running   0             66d
cgroups-metrics-p5grq   1/1     Running   0             66d
cgroups-metrics-p7bjf   1/1     Running   0             66d
cgroups-metrics-p7tk2   1/1     Running   0             66d
cgroups-metrics-ptgkl   1/1     Running   0             66d
cgroups-metrics-pvf4d   1/1     Running   0             66d
cgroups-metrics-q2xgq   1/1     Running   0             66d
cgroups-metrics-q69ws   1/1     Running   0             66d
cgroups-metrics-qdggd   1/1     Running   0             45d
cgroups-metrics-qq2wf   1/1     Running   0             66d
cgroups-metrics-r76c2   1/1     Running   0             66d
cgroups-metrics-rxkll   1/1     Running   0             66d
cgroups-metrics-s28jf   1/1     Running   1 (21d ago)   66d
cgroups-metrics-sb46m   1/1     Running   0             66d
cgroups-metrics-sjpb2   1/1     Running   0             66d
cgroups-metrics-sqnkg   1/1     Running   0             66d
cgroups-metrics-sxzgh   1/1     Running   0             64d
cgroups-metrics-t5c29   1/1     Running   0             66d
cgroups-metrics-v6dqk   1/1     Running   0             42d
cgroups-metrics-v9r45   1/1     Running   0             66d
cgroups-metrics-vdphv   1/1     Running   0             66d
cgroups-metrics-vgt2l   1/1     Running   0             66d
cgroups-metrics-vj2ml   1/1     Running   0             66d
cgroups-metrics-vnqgc   1/1     Running   0             66d
cgroups-metrics-vzxnq   1/1     Running   0             63d
cgroups-metrics-w75rg   1/1     Running   0             66d
cgroups-metrics-w9sj5   1/1     Running   0             36d
cgroups-metrics-wmxpz   1/1     Running   0             66d
cgroups-metrics-wrx4v   1/1     Running   0             66d
cgroups-metrics-wtx5q   1/1     Running   0             66d
cgroups-metrics-wxwch   1/1     Running   0             66d
cgroups-metrics-x2tjl   1/1     Running   0             43d
cgroups-metrics-xk7tm   1/1     Running   0             66d
cgroups-metrics-z7sp2   1/1     Running   0             66d
cgroups-metrics-zccqd   1/1     Running   0             66d
kubectl -n $sav get pods -l app.kubernetes.io/instance=cgroups-metrics
NAME                    READY   STATUS    RESTARTS        AGE
cgroups-metrics-2kbg2   1/1     Running   0               31d
cgroups-metrics-2zb52   1/1     Running   0               2d11h
cgroups-metrics-444vm   1/1     Running   0               45h
cgroups-metrics-4n4kr   1/1     Running   0               31d
cgroups-metrics-4n4td   1/1     Running   0               31d
cgroups-metrics-4rdwh   1/1     Running   0               31d
cgroups-metrics-5l59j   1/1     Running   0               31d
cgroups-metrics-6bdlv   1/1     Running   0               31d
cgroups-metrics-7vglz   1/1     Running   0               31d
cgroups-metrics-8jgkn   1/1     Running   0               31d
cgroups-metrics-9fcpp   1/1     Running   0               31d
cgroups-metrics-9kl2n   1/1     Running   0               31d
cgroups-metrics-9n6wr   1/1     Running   0               31d
cgroups-metrics-9pps2   1/1     Running   2 (3d5h ago)    31d
cgroups-metrics-9w27x   1/1     Running   0               31d
cgroups-metrics-b6lgq   1/1     Running   0               43h
cgroups-metrics-b7bl5   1/1     Running   0               31d
cgroups-metrics-bf8cl   1/1     Running   0               31d
cgroups-metrics-bnspl   1/1     Running   0               13d
cgroups-metrics-cct2c   1/1     Running   0               2d4h
cgroups-metrics-d7dt4   1/1     Running   0               31d
cgroups-metrics-dbjmj   1/1     Running   0               10d
cgroups-metrics-g4nr5   1/1     Running   0               31d
cgroups-metrics-gbxzl   1/1     Running   0               2d15h
cgroups-metrics-hbhdk   1/1     Running   0               31d
cgroups-metrics-jldfk   1/1     Running   1 (13d ago)     31d
cgroups-metrics-jzlwq   1/1     Running   0               15d
cgroups-metrics-kg29s   1/1     Running   0               31d
cgroups-metrics-kk6x2   1/1     Running   0               31d
cgroups-metrics-kx4tz   1/1     Running   0               31d
cgroups-metrics-lrxbt   1/1     Running   0               31d
cgroups-metrics-mgqwb   1/1     Running   0               31d
cgroups-metrics-nc4mp   1/1     Running   0               31d
cgroups-metrics-pnr2q   1/1     Running   0               31d
cgroups-metrics-pr8vm   1/1     Running   0               31d
cgroups-metrics-q4gxl   1/1     Running   0               31d
cgroups-metrics-qb6dp   1/1     Running   0               31d
cgroups-metrics-qsqcb   1/1     Running   0               31d
cgroups-metrics-rltft   1/1     Running   0               31d
cgroups-metrics-rn68q   1/1     Running   0               31d
cgroups-metrics-rxnmg   1/1     Running   0               31d
cgroups-metrics-srshk   1/1     Running   0               31d
cgroups-metrics-sxr2s   1/1     Running   0               34h
cgroups-metrics-t642x   1/1     Running   0               31d
cgroups-metrics-t8fst   1/1     Running   0               31d
cgroups-metrics-tnpdv   1/1     Running   0               31d
cgroups-metrics-tnsbv   1/1     Running   0               31d
cgroups-metrics-tqh6k   1/1     Running   0               14d
cgroups-metrics-v2r4g   1/1     Running   0               2d17h
cgroups-metrics-wk6nn   1/1     Running   0               31d
cgroups-metrics-x275s   1/1     Running   0               31d
cgroups-metrics-x8jkt   1/1     Running   0               31d
cgroups-metrics-xh2f5   1/1     Running   0               31d
cgroups-metrics-xrxvf   1/1     Running   0               31d
cgroups-metrics-z28t4   1/1     Running   0               31d
cgroups-metrics-z8crf   1/1     Running   0               31d
cgroups-metrics-zk66t   1/1     Running   1 (3d14h ago)   29d
cgroups-metrics-zxln4   1/1     Running   0               10d
bruceg commented 7 months ago

Right, I would expect it to be fairly rare. It would happen where the list of groups changed between listing the directory and reading the contents of the files in it. It would probably help if Vector was busy at the same time and ended up scheduling away from the host metrics task internally.

adolsalamanca commented 5 months ago

Another occurrence, leaving logs below

cgroups-metrics-z64mh vector 2024-01-31T13:03:44.710012Z  INFO vector::app: Internal log rate limit configured. internal_log_rate_secs=10
cgroups-metrics-z64mh vector 2024-01-31T13:03:44.712114Z  INFO vector::app: Log level is enabled. level="info"
cgroups-metrics-z64mh vector 2024-01-31T13:03:44.713223Z  INFO vector::app: Loading configs. paths=["/etc/vector"]
cgroups-metrics-z64mh vector 2024-01-31T13:03:44.789217Z  WARN vector::config::loading: Transform "cgroup_router._unmatched" has no consumers
cgroups-metrics-z64mh vector 2024-01-31T13:03:44.796030Z  INFO source{component_kind="source" component_id=cgroup_metrics component_type=host_metrics component_name=cgroup_metrics}: vector::sources::host_metrics: PROCFS_ROOT is set in envvars. Using custom for procfs. custom="/host/proc"
cgroups-metrics-z64mh vector 2024-01-31T13:03:44.796043Z  INFO source{component_kind="source" component_id=cgroup_metrics component_type=host_metrics component_name=cgroup_metrics}: vector::sources::host_metrics: SYSFS_ROOT is set in envvars. Using custom for sysfs. custom="/host/sys"
cgroups-metrics-z64mh vector 2024-01-31T13:03:44.815354Z ERROR vector_buffers::variants::disk_v2::writer: Last written record was unable to be deserialized. Corruption likely. reason="invalid data: check failed for struct member payload: pointer out of bounds: base 0x7fbf8cb39ff8 offset -1377203812 not in range 0x7fbf91ca8000..0x7fbf91cdd000"
cgroups-metrics-z64mh vector 2024-01-31T13:03:46.243927Z  INFO vector::topology::running: Running healthchecks.
cgroups-metrics-z64mh vector 2024-01-31T13:03:46.244036Z  INFO vector::topology::builder: Healthcheck passed.
cgroups-metrics-z64mh vector 2024-01-31T13:03:46.244222Z  INFO vector: Vector has started. debug="false" version="0.27.1" arch="x86_64" revision="19a51f2 2023-02-22"
cgroups-metrics-z64mh vector 2024-01-31T13:03:46.244513Z  INFO vector::sinks::prometheus::exporter: Building HTTP server. address=0.0.0.0:9090
cgroups-metrics-z64mh vector 2024-01-31T13:03:46.251556Z  INFO vector::internal_events::api: API server running. address=0.0.0.0:8686 playground=off
cgroups-metrics-z64mh vector 2024-01-31T13:03:46.269963Z  INFO vector::topology::builder: Healthcheck passed.
cgroups-metrics-z64mh vector 2024-01-31T13:03:46.274133Z  INFO vector::topology::builder: Healthcheck passed.
cgroups-metrics-z64mh vector 2024-01-31T13:04:02.327559Z ERROR source{component_kind="source" component_id=cgroup_metrics component_type=host_metrics component_name=cgroup_metrics}: vector::internal_events::host_metrics: Failed to load cgroups children. error=No such file or directory (os error 2) error_type="reader_failed" stage="receiving" internal_log_rate_limit=true