Open danthegoodman1 opened 2 years ago
@zamazan4ik fluentbit is a good alternative
I know a lot of log forwarders (fluentbit, rsyslog, whatever else). But I am especially interested in Vector :)
Thanks for all of the thoughts everyone. I can appreciate the severity of these issues and believe this area (the kubernetes_logs
source) is something we'll be digging into again soon.
There seem to be a mix of issues reported in the comments above. I think they can be categorized into:
Vector just stopping watch the new pod logs after small period of time. It seems like this is random and does not affect all pods but it usually take near 2-10 minutes for it to start happening.
- From: https://github.com/vectordotdev/vector/issues/12014#issuecomment-1216723559
- I don't see anything in particular that jumps out from 0.19.3 -> 0.20.X that seems related to the
kubernetes_logs
source. One thing you could try to do there, @igor-nikiforov, is bisect down the nightly versions of Vector to identify the first one that manifests the issue. For v0.19.3 to v0.20.0, this would be from 2021-12-28 to 2022-02-10. If we had a narrower set of commits to look at, something might jump out.
glob_minimum_cooldown_ms
by dropping it down so that Vector scans for new files more frequently and max_read_bytes
by increasing it so that Vector reads more from each file before continuing on to the next file.If you don't think your issue is represented in the above set, please leave an additional comment!
@jszwedko we have our glob cooldown to 2 seconds and still have observed it. Ultimately we have to move to something that doesn’t drop logs, because we depend on logs to know when errors occur.
I can’t imagine that k8s is not a massive user base of vector. We aren’t logging very quickly either, maybe 10-20/s.
what would need to be done from us to get more urgency behind improving the kubernetes experience? I truly want to use vector but can’t.
I'm in the same boat as @danthegoodman1. We're currently using Vector in most of our K8s clusters and I love what it brings to the table, but the lack of attention and priority to this specific issue is concerning. What I've noticed is this bug seems to affect:
We're likely going to switch over to something else like fluent-bit for log collection until this issue is resolved.
I was able to find some settings that are now giving me a much more reliable deployment on Kubernetes after doing the following:
0.25.X
. It doesn't seem to be mentioned in the release notes, but release 0.25.0 contains a very impactful fix for the kubernetes_logs
source to help with annotations (relevant issue: https://github.com/vectordotdev/vector/issues/13467)kubernetes_logs
source config:
max_line_bytes: 16777216
max_read_bytes: 16777216
glob_minimum_cooldown_ms: 1000
The combination of these two changes have lead to zero dropped/missing logs over the past couple weeks. Previously I was using a lower max_line_bytes
max_read_bytes
value as I've seen recommended above and elsewhere, but still much higher than the default so after so many attempts and little to no change I stopped increasing it. After revisiting it and trying something wildly large, to my surprise it worked!
I was still receiving annotation failures though. After deep diving into the commit history I saw a seemingly unannounced workaround was in place for https://github.com/vectordotdev/vector/issues/13467 in the latest version, and since upgrading haven't seen any annotation issues.
Wow, that's good to know. I hope to see comments from Vector dev team about the possible fixes for the issues.
I was able to find some settings that are now giving me a much more reliable deployment on Kubernetes after doing the following:
- Update Vector to version
0.25.X
. It doesn't seem to be mentioned in the release notes, but release 0.25.0 contains a very impactful fix for thekubernetes_logs
source to help with annotations (relevant issue:kubernetes_logs
source seems to fail to annotate when pod names are reused #13467)- Add these settings to your
kubernetes_logs
source config:max_line_bytes: 16777216 max_read_bytes: 16777216 glob_minimum_cooldown_ms: 1000
The combination of these two changes have lead to zero dropped/missing logs over the past couple weeks. Previously I was using a lower
max_line_bytes
max_read_bytes
value as I've seen recommended above and elsewhere, but still much higher than the default so after so many attempts and little to no change I stopped increasing it. After revisiting it and trying something wildly large, to my surprise it worked!I was still receiving annotation failures though. After deep diving into the commit history I saw a seemingly unannounced workaround was in place for #13467 in the latest version, and since upgrading haven't seen any annotation issues.
Great! for us the glob time drop really did it for us
Just realised that glob_minimum_cooldown_ms: 1000
could dramatically increase CPU resource usage.
In my case (145 nodes in cluster) where I haven't yet faced described issue (I copied this config snippet from another cluster where such issue existed) dropping glob_minimum_cooldown_ms: 1000
option allowed me to decrease total CPU usage (all Vector pods summary) from 117 to 2 CPUs (avg per pod 0.4 CPU -> 0.01 CPU).
I am still checking that dropping this parameter haven't led to missing logs (cause such a big resource consumption change seems wrong to me).
Just realised that glob_minimum_cooldown_ms: 1000 could dramatically increase CPU resource usage.
@dm3ch I noticed this on my setup too, but in my scenario I think the "increased" CPU usage is just a result of the service actually processing logs :) Dropping this setting for me results in very low CPU usage as well, but I was receiving only about 10% of the logs I should've been as well.
With the glob_minimum_cooldown_ms: 1000
and higher CPU usage, I find that my resource usage more closely reflects the resource and capacity planning expectations laid out here, so I feel it's working as expected with this setting for my situation. Though yours may be different than mine.
@CharlieC3 Which version of the vector are you using?
@dm3ch I haven't needed to upgrade since my last comment, so I'm still running Docker image timberio/vector:0.25.1-distroless-libc
.
sharing my results before and after adding:
max_line_bytes: 16777216
max_read_bytes: 16777216
glob_minimum_cooldown_ms: 1000
to my kubernetes_logs
configuration
instead of ingesting bytes in 1 minute bursts, it now seems to funnel the logs at a steady pace
will defer to people more knowledgable to interpret why this happens
Still seeing this errors in with above config changes
2024-01-30T07:39:33.600554Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes_logs: Failed to annotate event with pod metadata. event=Log(LogEvent { inner: Inner { fields: Object({KeyString("file"): Bytes(b"/var/log/pods/revenue-reporting_pinot-standby-server-22_cc651b75-805d-4120-b5f2-98dd1cf89b11/server/0.log"), KeyString("message"): Bytes(b"2024-01-29T17:38:10.001557068Z stdout F \"ID\" : \"2247389355\","), KeyString("source_type"): Bytes(b"kubernetes_logs")}), size_cache: AtomicCell { value: None }, json_encoded_size_cache: AtomicCell { value: Some(NonZeroJsonSize(JsonSize(225))) } }, metadata: EventMetadata { value: Object({}), secrets: {}, finalizers: EventFinalizers([]), source_id: None, source_type: None, upstream_id: None, schema_definition: Definition { event_kind: Kind { bytes: Some(()), integer: Some(()), float: Some(()), boolean: Some(()), timestamp: Some(()), regex: Some(()), null: Some(()), undefined: Some(()), array: Some(Collection { known: {}, unknown: Unknown(Infinite(Infinite { bytes: Some(()), integer: Some(()), float: Some(()), boolean: Some(()), timestamp: Some(()), regex: Some(()), null: Some(()), array: Some(()), object: Some(()) })) }), object: Some(Collection { known: {}, unknown: Unknown(Infinite(Infinite { bytes: Some(()), integer: Some(()), float: Some(()), boolean: Some(()), timestamp: Some(()), regex: Some(()), null: Some(()), array: Some(()), object: Some(()) })) }) }, metadata_kind: Kind { bytes: None, integer: None, float: None, boolean: None, timestamp: None, regex: None, null: None, undefined: None, array: None, object: Some(Collection { known: {}, unknown: Unknown(Infinite(Infinite { bytes: Some(()), integer: Some(()), float: Some(()), boolean: Some(()), timestamp: Some(()), regex: Some(()), null: Some(()), array: Some(()), object: Some(()) })) }) }, meaning: {}, log_namespaces: {Vector, Legacy} }, dropped_fields: {}, datadog_origin_metadata: None } }) error_code="annotation_failed" error_type="reader_failed" stage="processing" internal_log_rate_limit=true
2024-01-30T07:39:33.600617Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes_logs: Internal log [Failed to annotate event with pod metadata.] is being suppressed to avoid flooding.
2024-01-30T07:39:43.693441Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes_logs: Internal log [Failed to annotate event with pod metadata.] has been suppressed 19999 times.
2024-01-30T07:39:43.693455Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes_logs: Failed to annotate event with pod metadata. event=Log(LogEvent { inner: Inner { fields: Object({KeyString("file"): Bytes(b"/var/log/pods/revenue-reporting_pinot-standby-server-22_cc651b75-805d-4120-b5f2-98dd1cf89b11/server/0.log"), KeyString("message"): Bytes(b"2024-01-29T17:38:10.097262663Z stdout F \t... 7 more"), KeyString("source_type"): Bytes(b"kubernetes_logs")}), size_cache: AtomicCell { value: None }, json_encoded_size_cache: AtomicCell { value: Some(NonZeroJsonSize(JsonSize(212))) } }, metadata: EventMetadata { value: Object({}), secrets: {}, finalizers: EventFinalizers([]), source_id: None, source_type: None, upstream_id: None, schema_definition: Definition { event_kind: Kind { bytes: Some(()), integer: Some(()), float: Some(()), boolean: Some(()), timestamp: Some(()), regex: Some(()), null: Some(()), undefined: Some(()), array: Some(Collection { known: {}, unknown: Unknown(Infinite(Infinite { bytes: Some(()), integer: Some(()), float: Some(()), boolean: Some(()), timestamp: Some(()), regex: Some(()), null: Some(()), array: Some(()), object: Some(()) })) }), object: Some(Collection { known: {}, unknown: Unknown(Infinite(Infinite { bytes: Some(()), integer: Some(()), float: Some(()), boolean: Some(()), timestamp: Some(()), regex: Some(()), null: Some(()), array: Some(()), object: Some(()) })) }) }, metadata_kind: Kind { bytes: None, integer: None, float: None, boolean: None, timestamp: None, regex: None, null: None, undefined: None, array: None, object: Some(Collection { known: {}, unknown: Unknown(Infinite(Infinite { bytes: Some(()), integer: Some(()), float: Some(()), boolean: Some(()), timestamp: Some(()), regex: Some(()), null: Some(()), array: Some(()), object: Some(()) })) }) }, meaning: {}, log_namespaces: {Vector, Legacy} }, dropped_fields: {}, datadog_origin_metadata: None } }) error_code="annotation_failed" error_type="reader_failed" stage="processing" internal_log_rate_limit=true
2024-01-30T07:39:43.693518Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes_logs: Internal log [Failed to annotate event with pod metadata.] is being suppressed to avoid flooding.
kubernetes_logs:
type: kubernetes_logs
timezone: local
max_line_bytes: 16777216
max_read_bytes: 16777216
glob_minimum_cooldown_ms: 1000
Any workaround would really help? .. we are using 0.35.0
I can appreciate the severity of these issues and believe this area (the kubernetes_logs source) is something we'll be digging into again soon.
2 years ago... Doesn't feel remotely appreciated. I'm wondering if the Datadog acq. is to blame.
@sumeet-zuora seems like you have some other error that's causing logs to get caught up
This is still an issue
We also encountered this random log loss problem when we were using version 0.39.0, and we specified the /var/log/pods/xxxx/xxxx/0.log in include and still did not read the logs.
A note for the community
Problem
After a while, our vector daemonset will randomly stop shipping logs for a select service (some other pods will keep shipping logs)
Configuration
Example Data
Additional Context
Only logs from vector:
References
Similar to this I've had in the past https://github.com/vectordotdev/vector/discussions/8634