Vector randomly stops shipping certain k8s logs

danthegoodman1 commented 2 years ago

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

After a while, our vector daemonset will randomly stop shipping logs for a select service (some other pods will keep shipping logs)

Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: vector
  namespace: observability
  labels:
    app.kubernetes.io/name: vector
    app.kubernetes.io/instance: vector
    app.kubernetes.io/component: Agent
    app.kubernetes.io/version: "0.20.0-distroless-libc"
data:
  agent.yaml: |
    data_dir: /vector-data-dir
    api:
      enabled: true
      address: 127.0.0.1:8686
      playground: false
    sources:
      kubernetes_logs:
        type: kubernetes_logs
      host_metrics:
        filesystem:
          devices:
            excludes: [binfmt_misc]
          filesystems:
            excludes: [binfmt_misc]
          mountPoints:
            excludes: ["*/proc/sys/fs/binfmt_misc"]
        type: host_metrics
      internal_metrics:
        type: internal_metrics
    transforms:
      setlevel:
        type: remap
        inputs: [kubernetes_logs]
        source: |-
          .temp = parse_json!(.message)
          if !exists(parse_json!(.message).level) {
            .level = "other"
          } else {
            .level = .temp.level
          }
          if exists(.temp.uri) {
            .uri = .temp.uri
          }
          if exists(.temp.msg) {
            .msg = .temp.msg
          }
          if exists(.temp.lat) {
            .lat = .temp.lat
            .lon = .temp.lon
          }
          del(.temp)
    sinks:
      prom_exporter:
        type: prometheus_exporter
        inputs: [host_metrics, internal_metrics]
        address: 0.0.0.0:9090
      # stdout:
      #   type: console
      #   inputs: [setlevel]
      #   encoding:
      #     codec: json
      loki:
        type: loki
        inputs:
          - "setlevel"
        endpoint: https://logs-prod-us-central1.grafana.net
        compression: gzip
        # remove_label_fields: true
        encoding:
          codec: json
        auth:
          password: ${LOKI_PASSWORD}
          user: "${LOKI_USERNAME}"
          strategy: basic
        labels:
          namespace: "{{ kubernetes.pod_namespace }}"
          pod: "{{ kubernetes.pod_name }}"
          level: "{{ level }}"
          app_label: "{{ kubernetes.pod_label.app }}"
          node: "{{ kubernetes.pod_node_name }}"
          pod_owner: "{{ kubernetes.pod_owner }}"
          cluster: ${CLUSTER_NAME}


### Version

timberio/vector:0.20.0-distroless-libc

### Debug Output

```text
Working on trying to get relevant debugging output, sends a lot of TRACE logs currently :P

Example Data

{"func":"github.com/xxx/xxx/crdb.ConnectToDB","level":"debug","msg":"Connected to CRDB","time":"2022-03-29T17:43:21.015Z"}

Additional Context

Only logs from vector:

2022-03-29T17:35:57.183220Z  WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Internal log [Mapping failed with event.] has
 been rate limited 10 times.
 2022-03-29T17:35:57.183241Z  WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Mapping failed with event. error="function cal
 l error for \"parse_json\" at (8:29): unable to parse json: trailing characters at line 1 column 5" internal_log_rate_secs=30
 2022-03-29T17:35:57.200043Z  WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Internal log [Mapping failed with event.] is b
 eing rate limited.
 2022-03-29T17:36:32.201827Z  WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Internal log [Mapping failed with event.] has
 been rate limited 8 times.
 2022-03-29T17:36:32.201877Z  WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Mapping failed with event. error="function cal
 l error for \"parse_json\" at (8:29): unable to parse json: trailing characters at line 1 column 5" internal_log_rate_secs=30
 2022-03-29T17:36:32.437566Z  WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Internal log [Mapping failed with event.] is b
 eing rate limited.

References

Similar to this I've had in the past https://github.com/vectordotdev/vector/discussions/8634

danthegoodman1 commented 2 years ago

@zamazan4ik fluentbit is a good alternative

zamazan4ik commented 2 years ago

I know a lot of log forwarders (fluentbit, rsyslog, whatever else). But I am especially interested in Vector :)

jszwedko commented 2 years ago

Thanks for all of the thoughts everyone. I can appreciate the severity of these issues and believe this area (the kubernetes_logs source) is something we'll be digging into again soon.

There seem to be a mix of issues reported in the comments above. I think they can be categorized into:

https://github.com/vectordotdev/vector/issues/13467
- this issue is fairly well understood at this point, but not easily remedied
Vector just stopping watch the new pod logs after small period of time. It seems like this is random and does not affect all pods but it usually take near 2-10 minutes for it to start happening.
- From: https://github.com/vectordotdev/vector/issues/12014#issuecomment-1216723559
- I don't see anything in particular that jumps out from 0.19.3 -> 0.20.X that seems related to the kubernetes_logs source. One thing you could try to do there, @igor-nikiforov, is bisect down the nightly versions of Vector to identify the first one that manifests the issue. For v0.19.3 to v0.20.0, this would be from 2021-12-28 to 2022-02-10. If we had a narrower set of commits to look at, something might jump out.
Issues with Vector not picking up files because they are rotated too quickly. For this I'd recommend trying to tweak glob_minimum_cooldown_ms by dropping it down so that Vector scans for new files more frequently and max_read_bytes by increasing it so that Vector reads more from each file before continuing on to the next file.
Logs stop shipping and many annotation errors are seen. This could be due to https://github.com/vectordotdev/vector/issues/13467 or the cached pod metadata being deleted before Vector finishes reading the associated files. For the latter you can try increasing delay_deletion_ms to a higher value than 60000 (60s) so that Vector holds on to the metadata longer. This can exacerbate https://github.com/vectordotdev/vector/issues/13467 though. Notably the annotation issues shouldn't stop Vector from reading the logs, they will just be lacking the Kubernetes metadata (like pod namespace, etc.).

If you don't think your issue is represented in the above set, please leave an additional comment!

danthegoodman1 commented 2 years ago

@jszwedko we have our glob cooldown to 2 seconds and still have observed it. Ultimately we have to move to something that doesn’t drop logs, because we depend on logs to know when errors occur.

I can’t imagine that k8s is not a massive user base of vector. We aren’t logging very quickly either, maybe 10-20/s.

what would need to be done from us to get more urgency behind improving the kubernetes experience? I truly want to use vector but can’t.

CharlieC3 commented 2 years ago

I'm in the same boat as @danthegoodman1. We're currently using Vector in most of our K8s clusters and I love what it brings to the table, but the lack of attention and priority to this specific issue is concerning. What I've noticed is this bug seems to affect:

Similarly named pods like those created by StatefulSets the most
Pods running on a large node with more than a couple dozen neighboring pods, despite the Vector agent having more than enough resources

We're likely going to switch over to something else like fluent-bit for log collection until this issue is resolved.

CharlieC3 commented 1 year ago

I was able to find some settings that are now giving me a much more reliable deployment on Kubernetes after doing the following:

Update Vector to version 0.25.X. It doesn't seem to be mentioned in the release notes, but release 0.25.0 contains a very impactful fix for the kubernetes_logs source to help with annotations (relevant issue: https://github.com/vectordotdev/vector/issues/13467)

Add these settings to your kubernetes_logs source config:

max_line_bytes: 16777216
max_read_bytes: 16777216
glob_minimum_cooldown_ms: 1000

The combination of these two changes have lead to zero dropped/missing logs over the past couple weeks. Previously I was using a lower max_line_bytes max_read_bytes value as I've seen recommended above and elsewhere, but still much higher than the default so after so many attempts and little to no change I stopped increasing it. After revisiting it and trying something wildly large, to my surprise it worked!

I was still receiving annotation failures though. After deep diving into the commit history I saw a seemingly unannounced workaround was in place for https://github.com/vectordotdev/vector/issues/13467 in the latest version, and since upgrading haven't seen any annotation issues.

zamazan4ik commented 1 year ago

Wow, that's good to know. I hope to see comments from Vector dev team about the possible fixes for the issues.

danthegoodman1 commented 1 year ago

I was able to find some settings that are now giving me a much more reliable deployment on Kubernetes after doing the following:
Update Vector to version 0.25.X. It doesn't seem to be mentioned in the release notes, but release 0.25.0 contains a very impactful fix for the kubernetes_logs source to help with annotations (relevant issue: kubernetes_logs source seems to fail to annotate when pod names are reused #13467)
Add these settings to your kubernetes_logs source config:
max_line_bytes: 16777216
max_read_bytes: 16777216
glob_minimum_cooldown_ms: 1000
The combination of these two changes have lead to zero dropped/missing logs over the past couple weeks. Previously I was using a lower max_line_bytes max_read_bytes value as I've seen recommended above and elsewhere, but still much higher than the default so after so many attempts and little to no change I stopped increasing it. After revisiting it and trying something wildly large, to my surprise it worked!

I was still receiving annotation failures though. After deep diving into the commit history I saw a seemingly unannounced workaround was in place for #13467 in the latest version, and since upgrading haven't seen any annotation issues.

Great! for us the glob time drop really did it for us

dm3ch commented 1 year ago

Just realised that glob_minimum_cooldown_ms: 1000 could dramatically increase CPU resource usage.

In my case (145 nodes in cluster) where I haven't yet faced described issue (I copied this config snippet from another cluster where such issue existed) dropping glob_minimum_cooldown_ms: 1000 option allowed me to decrease total CPU usage (all Vector pods summary) from 117 to 2 CPUs (avg per pod 0.4 CPU -> 0.01 CPU).

I am still checking that dropping this parameter haven't led to missing logs (cause such a big resource consumption change seems wrong to me).

CharlieC3 commented 1 year ago

Just realised that glob_minimum_cooldown_ms: 1000 could dramatically increase CPU resource usage.

@dm3ch I noticed this on my setup too, but in my scenario I think the "increased" CPU usage is just a result of the service actually processing logs :) Dropping this setting for me results in very low CPU usage as well, but I was receiving only about 10% of the logs I should've been as well. With the glob_minimum_cooldown_ms: 1000 and higher CPU usage, I find that my resource usage more closely reflects the resource and capacity planning expectations laid out here, so I feel it's working as expected with this setting for my situation. Though yours may be different than mine.

dm3ch commented 1 year ago

@CharlieC3 Which version of the vector are you using?

CharlieC3 commented 1 year ago

@dm3ch I haven't needed to upgrade since my last comment, so I'm still running Docker image timberio/vector:0.25.1-distroless-libc.

k24dizzle commented 1 year ago

sharing my results before and after adding:

max_line_bytes: 16777216
max_read_bytes: 16777216
glob_minimum_cooldown_ms: 1000

to my kubernetes_logs configuration

instead of ingesting bytes in 1 minute bursts, it now seems to funnel the logs at a steady pace

will defer to people more knowledgable to interpret why this happens

sumeet-zuora commented 9 months ago

Still seeing this errors in with above config changes

2024-01-30T07:39:33.600554Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes_logs: Failed to annotate event with pod metadata. event=Log(LogEvent { inner: Inner { fields: Object({KeyString("file"): Bytes(b"/var/log/pods/revenue-reporting_pinot-standby-server-22_cc651b75-805d-4120-b5f2-98dd1cf89b11/server/0.log"), KeyString("message"): Bytes(b"2024-01-29T17:38:10.001557068Z stdout F     \"ID\" : \"2247389355\","), KeyString("source_type"): Bytes(b"kubernetes_logs")}), size_cache: AtomicCell { value: None }, json_encoded_size_cache: AtomicCell { value: Some(NonZeroJsonSize(JsonSize(225))) } }, metadata: EventMetadata { value: Object({}), secrets: {}, finalizers: EventFinalizers([]), source_id: None, source_type: None, upstream_id: None, schema_definition: Definition { event_kind: Kind { bytes: Some(()), integer: Some(()), float: Some(()), boolean: Some(()), timestamp: Some(()), regex: Some(()), null: Some(()), undefined: Some(()), array: Some(Collection { known: {}, unknown: Unknown(Infinite(Infinite { bytes: Some(()), integer: Some(()), float: Some(()), boolean: Some(()), timestamp: Some(()), regex: Some(()), null: Some(()), array: Some(()), object: Some(()) })) }), object: Some(Collection { known: {}, unknown: Unknown(Infinite(Infinite { bytes: Some(()), integer: Some(()), float: Some(()), boolean: Some(()), timestamp: Some(()), regex: Some(()), null: Some(()), array: Some(()), object: Some(()) })) }) }, metadata_kind: Kind { bytes: None, integer: None, float: None, boolean: None, timestamp: None, regex: None, null: None, undefined: None, array: None, object: Some(Collection { known: {}, unknown: Unknown(Infinite(Infinite { bytes: Some(()), integer: Some(()), float: Some(()), boolean: Some(()), timestamp: Some(()), regex: Some(()), null: Some(()), array: Some(()), object: Some(()) })) }) }, meaning: {}, log_namespaces: {Vector, Legacy} }, dropped_fields: {}, datadog_origin_metadata: None } }) error_code="annotation_failed" error_type="reader_failed" stage="processing" internal_log_rate_limit=true
2024-01-30T07:39:33.600617Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes_logs: Internal log [Failed to annotate event with pod metadata.] is being suppressed to avoid flooding.
2024-01-30T07:39:43.693441Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes_logs: Internal log [Failed to annotate event with pod metadata.] has been suppressed 19999 times.
2024-01-30T07:39:43.693455Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes_logs: Failed to annotate event with pod metadata. event=Log(LogEvent { inner: Inner { fields: Object({KeyString("file"): Bytes(b"/var/log/pods/revenue-reporting_pinot-standby-server-22_cc651b75-805d-4120-b5f2-98dd1cf89b11/server/0.log"), KeyString("message"): Bytes(b"2024-01-29T17:38:10.097262663Z stdout F \t... 7 more"), KeyString("source_type"): Bytes(b"kubernetes_logs")}), size_cache: AtomicCell { value: None }, json_encoded_size_cache: AtomicCell { value: Some(NonZeroJsonSize(JsonSize(212))) } }, metadata: EventMetadata { value: Object({}), secrets: {}, finalizers: EventFinalizers([]), source_id: None, source_type: None, upstream_id: None, schema_definition: Definition { event_kind: Kind { bytes: Some(()), integer: Some(()), float: Some(()), boolean: Some(()), timestamp: Some(()), regex: Some(()), null: Some(()), undefined: Some(()), array: Some(Collection { known: {}, unknown: Unknown(Infinite(Infinite { bytes: Some(()), integer: Some(()), float: Some(()), boolean: Some(()), timestamp: Some(()), regex: Some(()), null: Some(()), array: Some(()), object: Some(()) })) }), object: Some(Collection { known: {}, unknown: Unknown(Infinite(Infinite { bytes: Some(()), integer: Some(()), float: Some(()), boolean: Some(()), timestamp: Some(()), regex: Some(()), null: Some(()), array: Some(()), object: Some(()) })) }) }, metadata_kind: Kind { bytes: None, integer: None, float: None, boolean: None, timestamp: None, regex: None, null: None, undefined: None, array: None, object: Some(Collection { known: {}, unknown: Unknown(Infinite(Infinite { bytes: Some(()), integer: Some(()), float: Some(()), boolean: Some(()), timestamp: Some(()), regex: Some(()), null: Some(()), array: Some(()), object: Some(()) })) }) }, meaning: {}, log_namespaces: {Vector, Legacy} }, dropped_fields: {}, datadog_origin_metadata: None } }) error_code="annotation_failed" error_type="reader_failed" stage="processing" internal_log_rate_limit=true
2024-01-30T07:39:43.693518Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes_logs: Internal log [Failed to annotate event with pod metadata.] is being suppressed to avoid flooding.

    kubernetes_logs:
      type: kubernetes_logs
      timezone: local
      max_line_bytes: 16777216
      max_read_bytes: 16777216
      glob_minimum_cooldown_ms: 1000

Any workaround would really help? .. we are using 0.35.0

danthegoodman1 commented 9 months ago

I can appreciate the severity of these issues and believe this area (the kubernetes_logs source) is something we'll be digging into again soon.

2 years ago... Doesn't feel remotely appreciated. I'm wondering if the Datadog acq. is to blame.

danthegoodman1 commented 9 months ago

@sumeet-zuora seems like you have some other error that's causing logs to get caught up

danthegoodman1 commented 5 months ago

This is still an issue

48N6E commented 4 months ago

We also encountered this random log loss problem when we were using version 0.39.0, and we specified the /var/log/pods/xxxx/xxxx/0.log in include and still did not read the logs.

vectordotdev / vector