vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.03k stars 1.47k forks source link

Vector agent hangs using k8s_logs as sources #9489

Open isality opened 2 years ago

isality commented 2 years ago

Hi there, i hame a problem with Vector agent and k8s_logs as sources.

Vector Version

vector --version
vector 0.16.1 (x86_64-unknown-linux-musl 739e878 2021-08-26)

Vector Configuration File

api:
  address: 0.0.0.0:8686
  enabled: false
  playground: false
data_dir: /var/lib/vector
log_schema:
  host_key: host
  message_key: message
  source_type_key: source_type
  timestamp_key: '@timestamp'
sinks:
  elasticsearch:
    auth:
      password: ${APP_ES_PASSWORD:-secret}
      strategy: basic
      user: ${APP_ES_USER:-elastic}
    batch:
      timeout_secs: 60
    buffer:
      max_size: 209800000
      type: disk
      when_full: block
    compression: none
    encoding:
      timestamp_format: rfc3339
    endpoint: ${APP_ES_ENDPOINT:-http://127.0.0.1:9200}
    healthcheck:
      enabled: true
    index: ${APP_ES_INDEX_PREFIX:-app}-{{ kubernetes.pod_namespace }}-%Y.%m.%d
    inputs:
    - k8s_rm_fields
    mode: normal
    request:
      concurrency: adaptive
      retry_attempts: 2
      retry_max_duration_secs: 5
    tls:
      ca_file: /vector-data-dir/tls/ca.crt
      verify_hostname: false
    type: elasticsearch
  vector_logs_console:
    encoding:
      codec: text
      timestamp_format: rfc3339
    inputs:
    - vector_logs
    target: stdout
    type: console
  vector_logs_exporter:
    address: 0.0.0.0:9090
    default_namespace: service
    inputs:
    - vector_logs_metrics
    type: prometheus_exporter
sources:
  k8s_logs:
    auto_partial_merge: true
    fingerprint_lines: 1
    glob_minimum_cooldown_ms: 1000
    max_line_bytes: 327680
    type: kubernetes_logs
  vector_logs:
    type: internal_logs
  vector_logs_metrics:
    scrape_interval_secs: 60
    type: internal_metrics
timezone: Europe/Moscow
transforms:
  k8s_rm_fields:
    fields:
    - kubernetes.pod_labels
    inputs:
    - k8s_logs
    type: remove_fields

Debug Output

In prod env i cant run Vector in Debug mode.

Expected Behavior

Vector agent works fine and agent does not hang.

Actual Behavior

Vector agent hangs after 1-5 days. Problem is solving by restarting a pod.

Example Data

I got this messages in a pod logs:

Oct 04 08:55:58.971 ERROR source{component_kind="source" component_id=k8s_logs component_type=kubernetes_logs component_name=k8s_logs}: vector::internal_events::kubernetes::instrumenting_watcher: Watch stream failed. error=Desync { source: Desync } internal_log_rate_secs=5
Oct 04 08:55:58.972  WARN source{component_kind="source" component_id=k8s_logs component_type=kubernetes_logs component_name=k8s_logs}: vector::internal_events::kubernetes::reflector: Handling desync. error=Desync
Watch stream failed.
Handling desync.
Oct 04 09:00:50.979 ERROR source{component_kind="source" component_id=k8s_logs component_type=kubernetes_logs component_name=k8s_logs}: vector::internal_events::kubernetes::instrumenting_watcher: Watch stream failed. error=Desync { source: Desync } internal_log_rate_secs=5
Oct 04 09:00:50.980  WARN source{component_kind="source" component_id=k8s_logs component_type=kubernetes_logs component_name=k8s_logs}: vector::internal_events::kubernetes::reflector: Handling desync. error=Desync
Watch stream failed.
Handling desync.
Oct 04 09:05:42.987 ERROR source{component_kind="source" component_id=k8s_logs component_type=kubernetes_logs component_name=k8s_logs}: vector::internal_events::kubernetes::instrumenting_watcher: Watch stream failed. error=Desync { source: Desync } internal_log_rate_secs=5
Oct 04 09:05:42.987  WARN source{component_kind="source" component_id=k8s_logs component_type=kubernetes_logs component_name=k8s_logs}: vector::internal_events::kubernetes::reflector: Handling desync. error=Desync
Watch stream failed.
Handling desync.
Oct 04 09:10:34.994 ERROR source{component_kind="source" component_id=k8s_logs component_type=kubernetes_logs component_name=k8s_logs}: vector::internal_events::kubernetes::instrumenting_watcher: Watch stream failed. error=Desync { source: Desync } internal_log_rate_secs=5
Oct 04 09:10:34.994  WARN source{component_kind="source" component_id=k8s_logs component_type=kubernetes_logs component_name=k8s_logs}: vector::internal_events::kubernetes::reflector: Handling desync. error=Desync
svenmueller commented 2 years ago

We experience the similar issue (using nightly-2021-11-18) in our biggest cluster (> 300 nodes).

jszwedko commented 1 year ago

I'm curious if this is still seen in the latest versions of Vector as we replaced much of the k8s integration code with the Rust SDK (kube-rs) and this issue seems related to fetching k8s metadata.

akutta commented 1 year ago

We are running 0.24.x (can double check which exactly) and have experienced this issue. Can try updating to latest in the near future and see if we notice it again.

For a data point, our k8s nodes are replaced every 7 days and have only noticed this issue once across several thousand nodes.

We don't have a good detector on this issue yet as we only recently noticed it, so could be more frequent