vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
18.22k stars 1.6k forks source link

Vector randomly stops shipping certain k8s logs #12014

Open danthegoodman1 opened 2 years ago

danthegoodman1 commented 2 years ago

A note for the community

Problem

After a while, our vector daemonset will randomly stop shipping logs for a select service (some other pods will keep shipping logs)

Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: vector
  namespace: observability
  labels:
    app.kubernetes.io/name: vector
    app.kubernetes.io/instance: vector
    app.kubernetes.io/component: Agent
    app.kubernetes.io/version: "0.20.0-distroless-libc"
data:
  agent.yaml: |
    data_dir: /vector-data-dir
    api:
      enabled: true
      address: 127.0.0.1:8686
      playground: false
    sources:
      kubernetes_logs:
        type: kubernetes_logs
      host_metrics:
        filesystem:
          devices:
            excludes: [binfmt_misc]
          filesystems:
            excludes: [binfmt_misc]
          mountPoints:
            excludes: ["*/proc/sys/fs/binfmt_misc"]
        type: host_metrics
      internal_metrics:
        type: internal_metrics
    transforms:
      setlevel:
        type: remap
        inputs: [kubernetes_logs]
        source: |-
          .temp = parse_json!(.message)
          if !exists(parse_json!(.message).level) {
            .level = "other"
          } else {
            .level = .temp.level
          }
          if exists(.temp.uri) {
            .uri = .temp.uri
          }
          if exists(.temp.msg) {
            .msg = .temp.msg
          }
          if exists(.temp.lat) {
            .lat = .temp.lat
            .lon = .temp.lon
          }
          del(.temp)
    sinks:
      prom_exporter:
        type: prometheus_exporter
        inputs: [host_metrics, internal_metrics]
        address: 0.0.0.0:9090
      # stdout:
      #   type: console
      #   inputs: [setlevel]
      #   encoding:
      #     codec: json
      loki:
        type: loki
        inputs:
          - "setlevel"
        endpoint: https://logs-prod-us-central1.grafana.net
        compression: gzip
        # remove_label_fields: true
        encoding:
          codec: json
        auth:
          password: ${LOKI_PASSWORD}
          user: "${LOKI_USERNAME}"
          strategy: basic
        labels:
          namespace: "{{ kubernetes.pod_namespace }}"
          pod: "{{ kubernetes.pod_name }}"
          level: "{{ level }}"
          app_label: "{{ kubernetes.pod_label.app }}"
          node: "{{ kubernetes.pod_node_name }}"
          pod_owner: "{{ kubernetes.pod_owner }}"
          cluster: ${CLUSTER_NAME}

### Version

timberio/vector:0.20.0-distroless-libc

### Debug Output

```text
Working on trying to get relevant debugging output, sends a lot of TRACE logs currently :P

Example Data

{"func":"github.com/xxx/xxx/crdb.ConnectToDB","level":"debug","msg":"Connected to CRDB","time":"2022-03-29T17:43:21.015Z"}

Additional Context

Only logs from vector:

2022-03-29T17:35:57.183220Z  WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Internal log [Mapping failed with event.] has
 been rate limited 10 times.
 2022-03-29T17:35:57.183241Z  WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Mapping failed with event. error="function cal
 l error for \"parse_json\" at (8:29): unable to parse json: trailing characters at line 1 column 5" internal_log_rate_secs=30
 2022-03-29T17:35:57.200043Z  WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Internal log [Mapping failed with event.] is b
 eing rate limited.
 2022-03-29T17:36:32.201827Z  WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Internal log [Mapping failed with event.] has
 been rate limited 8 times.
 2022-03-29T17:36:32.201877Z  WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Mapping failed with event. error="function cal
 l error for \"parse_json\" at (8:29): unable to parse json: trailing characters at line 1 column 5" internal_log_rate_secs=30
 2022-03-29T17:36:32.437566Z  WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Internal log [Mapping failed with event.] is b
 eing rate limited.

References

Similar to this I've had in the past https://github.com/vectordotdev/vector/discussions/8634

danthegoodman1 commented 2 years ago

@zamazan4ik fluentbit is a good alternative

zamazan4ik commented 2 years ago

I know a lot of log forwarders (fluentbit, rsyslog, whatever else). But I am especially interested in Vector :)

jszwedko commented 2 years ago

Thanks for all of the thoughts everyone. I can appreciate the severity of these issues and believe this area (the kubernetes_logs source) is something we'll be digging into again soon.

There seem to be a mix of issues reported in the comments above. I think they can be categorized into:

If you don't think your issue is represented in the above set, please leave an additional comment!

danthegoodman1 commented 2 years ago

@jszwedko we have our glob cooldown to 2 seconds and still have observed it. Ultimately we have to move to something that doesn’t drop logs, because we depend on logs to know when errors occur.

I can’t imagine that k8s is not a massive user base of vector. We aren’t logging very quickly either, maybe 10-20/s.

what would need to be done from us to get more urgency behind improving the kubernetes experience? I truly want to use vector but can’t.

CharlieC3 commented 2 years ago

I'm in the same boat as @danthegoodman1. We're currently using Vector in most of our K8s clusters and I love what it brings to the table, but the lack of attention and priority to this specific issue is concerning. What I've noticed is this bug seems to affect:

  1. Similarly named pods like those created by StatefulSets the most
  2. Pods running on a large node with more than a couple dozen neighboring pods, despite the Vector agent having more than enough resources

We're likely going to switch over to something else like fluent-bit for log collection until this issue is resolved.

CharlieC3 commented 1 year ago

I was able to find some settings that are now giving me a much more reliable deployment on Kubernetes after doing the following:

  1. Update Vector to version 0.25.X. It doesn't seem to be mentioned in the release notes, but release 0.25.0 contains a very impactful fix for the kubernetes_logs source to help with annotations (relevant issue: https://github.com/vectordotdev/vector/issues/13467)
  2. Add these settings to your kubernetes_logs source config:
    max_line_bytes: 16777216
    max_read_bytes: 16777216
    glob_minimum_cooldown_ms: 1000

The combination of these two changes have lead to zero dropped/missing logs over the past couple weeks. Previously I was using a lower max_line_bytes max_read_bytes value as I've seen recommended above and elsewhere, but still much higher than the default so after so many attempts and little to no change I stopped increasing it. After revisiting it and trying something wildly large, to my surprise it worked!

I was still receiving annotation failures though. After deep diving into the commit history I saw a seemingly unannounced workaround was in place for https://github.com/vectordotdev/vector/issues/13467 in the latest version, and since upgrading haven't seen any annotation issues.

zamazan4ik commented 1 year ago

Wow, that's good to know. I hope to see comments from Vector dev team about the possible fixes for the issues.

danthegoodman1 commented 1 year ago

I was able to find some settings that are now giving me a much more reliable deployment on Kubernetes after doing the following:

  1. Update Vector to version 0.25.X. It doesn't seem to be mentioned in the release notes, but release 0.25.0 contains a very impactful fix for the kubernetes_logs source to help with annotations (relevant issue: kubernetes_logs source seems to fail to annotate when pod names are reused #13467)
  2. Add these settings to your kubernetes_logs source config:
    max_line_bytes: 16777216
    max_read_bytes: 16777216
    glob_minimum_cooldown_ms: 1000

The combination of these two changes have lead to zero dropped/missing logs over the past couple weeks. Previously I was using a lower max_line_bytes max_read_bytes value as I've seen recommended above and elsewhere, but still much higher than the default so after so many attempts and little to no change I stopped increasing it. After revisiting it and trying something wildly large, to my surprise it worked!

I was still receiving annotation failures though. After deep diving into the commit history I saw a seemingly unannounced workaround was in place for #13467 in the latest version, and since upgrading haven't seen any annotation issues.

Great! for us the glob time drop really did it for us

dm3ch commented 1 year ago

Just realised that glob_minimum_cooldown_ms: 1000 could dramatically increase CPU resource usage.

In my case (145 nodes in cluster) where I haven't yet faced described issue (I copied this config snippet from another cluster where such issue existed) dropping glob_minimum_cooldown_ms: 1000 option allowed me to decrease total CPU usage (all Vector pods summary) from 117 to 2 CPUs (avg per pod 0.4 CPU -> 0.01 CPU).

I am still checking that dropping this parameter haven't led to missing logs (cause such a big resource consumption change seems wrong to me).

CharlieC3 commented 1 year ago

Just realised that glob_minimum_cooldown_ms: 1000 could dramatically increase CPU resource usage.

@dm3ch I noticed this on my setup too, but in my scenario I think the "increased" CPU usage is just a result of the service actually processing logs :) Dropping this setting for me results in very low CPU usage as well, but I was receiving only about 10% of the logs I should've been as well. With the glob_minimum_cooldown_ms: 1000 and higher CPU usage, I find that my resource usage more closely reflects the resource and capacity planning expectations laid out here, so I feel it's working as expected with this setting for my situation. Though yours may be different than mine.

dm3ch commented 1 year ago

@CharlieC3 Which version of the vector are you using?

CharlieC3 commented 1 year ago

@dm3ch I haven't needed to upgrade since my last comment, so I'm still running Docker image timberio/vector:0.25.1-distroless-libc.

k24dizzle commented 1 year ago

sharing my results before and after adding:

max_line_bytes: 16777216
max_read_bytes: 16777216
glob_minimum_cooldown_ms: 1000

to my kubernetes_logs configuration

Screenshot 2023-10-11 at 10 06 33 PM

instead of ingesting bytes in 1 minute bursts, it now seems to funnel the logs at a steady pace

will defer to people more knowledgable to interpret why this happens

sumeet-zuora commented 9 months ago

Still seeing this errors in with above config changes

2024-01-30T07:39:33.600554Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes_logs: Failed to annotate event with pod metadata. event=Log(LogEvent { inner: Inner { fields: Object({KeyString("file"): Bytes(b"/var/log/pods/revenue-reporting_pinot-standby-server-22_cc651b75-805d-4120-b5f2-98dd1cf89b11/server/0.log"), KeyString("message"): Bytes(b"2024-01-29T17:38:10.001557068Z stdout F     \"ID\" : \"2247389355\","), KeyString("source_type"): Bytes(b"kubernetes_logs")}), size_cache: AtomicCell { value: None }, json_encoded_size_cache: AtomicCell { value: Some(NonZeroJsonSize(JsonSize(225))) } }, metadata: EventMetadata { value: Object({}), secrets: {}, finalizers: EventFinalizers([]), source_id: None, source_type: None, upstream_id: None, schema_definition: Definition { event_kind: Kind { bytes: Some(()), integer: Some(()), float: Some(()), boolean: Some(()), timestamp: Some(()), regex: Some(()), null: Some(()), undefined: Some(()), array: Some(Collection { known: {}, unknown: Unknown(Infinite(Infinite { bytes: Some(()), integer: Some(()), float: Some(()), boolean: Some(()), timestamp: Some(()), regex: Some(()), null: Some(()), array: Some(()), object: Some(()) })) }), object: Some(Collection { known: {}, unknown: Unknown(Infinite(Infinite { bytes: Some(()), integer: Some(()), float: Some(()), boolean: Some(()), timestamp: Some(()), regex: Some(()), null: Some(()), array: Some(()), object: Some(()) })) }) }, metadata_kind: Kind { bytes: None, integer: None, float: None, boolean: None, timestamp: None, regex: None, null: None, undefined: None, array: None, object: Some(Collection { known: {}, unknown: Unknown(Infinite(Infinite { bytes: Some(()), integer: Some(()), float: Some(()), boolean: Some(()), timestamp: Some(()), regex: Some(()), null: Some(()), array: Some(()), object: Some(()) })) }) }, meaning: {}, log_namespaces: {Vector, Legacy} }, dropped_fields: {}, datadog_origin_metadata: None } }) error_code="annotation_failed" error_type="reader_failed" stage="processing" internal_log_rate_limit=true
2024-01-30T07:39:33.600617Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes_logs: Internal log [Failed to annotate event with pod metadata.] is being suppressed to avoid flooding.
2024-01-30T07:39:43.693441Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes_logs: Internal log [Failed to annotate event with pod metadata.] has been suppressed 19999 times.
2024-01-30T07:39:43.693455Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes_logs: Failed to annotate event with pod metadata. event=Log(LogEvent { inner: Inner { fields: Object({KeyString("file"): Bytes(b"/var/log/pods/revenue-reporting_pinot-standby-server-22_cc651b75-805d-4120-b5f2-98dd1cf89b11/server/0.log"), KeyString("message"): Bytes(b"2024-01-29T17:38:10.097262663Z stdout F \t... 7 more"), KeyString("source_type"): Bytes(b"kubernetes_logs")}), size_cache: AtomicCell { value: None }, json_encoded_size_cache: AtomicCell { value: Some(NonZeroJsonSize(JsonSize(212))) } }, metadata: EventMetadata { value: Object({}), secrets: {}, finalizers: EventFinalizers([]), source_id: None, source_type: None, upstream_id: None, schema_definition: Definition { event_kind: Kind { bytes: Some(()), integer: Some(()), float: Some(()), boolean: Some(()), timestamp: Some(()), regex: Some(()), null: Some(()), undefined: Some(()), array: Some(Collection { known: {}, unknown: Unknown(Infinite(Infinite { bytes: Some(()), integer: Some(()), float: Some(()), boolean: Some(()), timestamp: Some(()), regex: Some(()), null: Some(()), array: Some(()), object: Some(()) })) }), object: Some(Collection { known: {}, unknown: Unknown(Infinite(Infinite { bytes: Some(()), integer: Some(()), float: Some(()), boolean: Some(()), timestamp: Some(()), regex: Some(()), null: Some(()), array: Some(()), object: Some(()) })) }) }, metadata_kind: Kind { bytes: None, integer: None, float: None, boolean: None, timestamp: None, regex: None, null: None, undefined: None, array: None, object: Some(Collection { known: {}, unknown: Unknown(Infinite(Infinite { bytes: Some(()), integer: Some(()), float: Some(()), boolean: Some(()), timestamp: Some(()), regex: Some(()), null: Some(()), array: Some(()), object: Some(()) })) }) }, meaning: {}, log_namespaces: {Vector, Legacy} }, dropped_fields: {}, datadog_origin_metadata: None } }) error_code="annotation_failed" error_type="reader_failed" stage="processing" internal_log_rate_limit=true
2024-01-30T07:39:43.693518Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes_logs: Internal log [Failed to annotate event with pod metadata.] is being suppressed to avoid flooding.
    kubernetes_logs:
      type: kubernetes_logs
      timezone: local
      max_line_bytes: 16777216
      max_read_bytes: 16777216
      glob_minimum_cooldown_ms: 1000 

Any workaround would really help? .. we are using 0.35.0

danthegoodman1 commented 9 months ago

I can appreciate the severity of these issues and believe this area (the kubernetes_logs source) is something we'll be digging into again soon.

2 years ago... Doesn't feel remotely appreciated. I'm wondering if the Datadog acq. is to blame.

danthegoodman1 commented 9 months ago

@sumeet-zuora seems like you have some other error that's causing logs to get caught up

danthegoodman1 commented 5 months ago

This is still an issue

48N6E commented 4 months ago

We also encountered this random log loss problem when we were using version 0.39.0, and we specified the /var/log/pods/xxxx/xxxx/0.log in include and still did not read the logs.