Closed anas-aso closed 2 years ago
Here is configuration used for the Vector agents:
customConfig:
data_dir: "/vector-data-dir"
sources:
kubernetes:
type: kubernetes_logs
max_line_bytes: 131072
delay_deletion_ms: 3600000
internal_logs:
type: internal_logs
internal_metrics:
type: internal_metrics
transforms:
parse_internal_logs:
type: remap
inputs:
- internal_logs
source: |-
.cloud_provider.service = "vector"
.host = "${HOSTNAME}"
sinks:
vector-aggregator:
type: vector
inputs:
- kubernetes
- parse_internal_logs
address: vector-aggregator-haproxy.logging.svc:9080
healthcheck:
enabled: true
prometheus-exporter:
type: prometheus_exporter
inputs:
- internal_metrics
address: 0.0.0.0:9090
Sorry for the delay - @svenmueller is Vector logging Failed to annotate event with pod metadata
(or namespace metadata)?
@spencergilbert Both: Here is a Vector agent's log after redacted:
2021-11-22T22:27:12.745573Z WARN source{component_kind="source" component_id=kubernetes component_type=kubernetes_logs component_name=kubernetes}: vector::internal_events::kubernetes_logs: Failed to annotate event with pod metadata. event=Log(LogEvent { fields: Map({"file": Bytes(b"/var/log/pods/NAMESPACE_NAME_POD_NAME-74b48d875b-jb9qz_60894207-4c53-459f-9170-448a7626216e/POD_NAME/0.log"), "message": Bytes(b"{\"log\":\"{\\\"queue\\\":\\\"QUEUE_NAME\\\"},\\\"kv\\\":{\\\"queue\\\":\\\"QUEUE_NAME\\\",\\\"topic\\\":\\\"TOPIC_NAME\\\",\\\"streamId\\\":\\\"f6c6bbdd-8652-48d3-afc5-b1504423eead\\\",\\\"commitNr\\\":1,\\\"nrOfRetries\\\":0},\\\"thread\\\":\\\"THREAD_NAME\\\",\\\"timestamp\\\":\\\"2021-11-22T21:10:52.997Z\\\",\\\"logger\\\":\\\"LOGGER_NAME\\\",\\\"severity\\\":\\\"INFO\\\",\\\"message\\\":\\\"[QUEUE_NAME, topic=TOPIC_NAME, streamId=f6c6bbdd-8652-48d3-afc5-b1504423eead, commitNr=1, nrOfRetries=0] Processed in 32 milliseconds\\\"}\\n\",\"stream\":\"stdout\",\"time\":\"2021-11-22T21:10:52.997543148Z\"}"), "source_type": Bytes(b"kubernetes_logs"), "timestamp": Timestamp(2021-11-22T22:27:12.745452449Z)}), metadata: EventMetadata { datadog_api_key: None, finalizers: EventFinalizers([]) } })
And:
2021-11-29T16:12:31.854590Z WARN source{component_kind="source" component_id=kubernetes component_type=kubernetes_logs component_name=kubernetes}: vector::internal_events::kubernetes_logs: Failed to annotate event with namespace metadata. event=Log(LogEvent { fields: Map({"file": Bytes(b"/var/log/pods/NAMESPACE_NAME_POD_NAME-6fcd9944c-ddtpz_fa7a420d-c9f0-428f-b39e-d7a5da7f81ac/nginx/0.log"), "kubernetes": Map({"container_id": Bytes(b"docker://8c9ec8710724df6b946d8c0ab43f96f9281eea7ff26787aeb923c6764e1fa953"), "container_image": Bytes(b"nginx:1.16.1"), "container_name": Bytes(b"nginx"), "pod_annotations": Map({"configChecksum": Bytes(b"0a7c51fd3bae693348190ce6e6fabb0ab801ad7890409b8197d803150afe2272"), "prometheus.io/path": Bytes(b"/metrics"), "prometheus.io/port": Bytes(b"8081"), "prometheus.io/scrape": Bytes(b"true")}), "pod_ip": Bytes(b"10.89.61.87"), "pod_ips": Array([Bytes(b"10.89.61.87")]), "pod_labels": Map({"app": Bytes(b"POD_NAME"), "chart": Bytes(b"CHART_NAME-1.0.8"), "pod-template-hash": Bytes(b"6fcd9944c"), "release": Bytes(b"NAMESPACE_NAME-CHART_NAME"), "version": Bytes(b"c20b86c48ce3e11007e026d84490fef2486faf6f")}), "pod_name": Bytes(b"POD_NAME-6fcd9944c-ddtpz"), "pod_namespace": Bytes(b"NAMESPACE_NAME"), "pod_node_name": Bytes(b"gke-nodepool-efd05fde-sp8k"), "pod_owner": Bytes(b"ReplicaSet/POD_NAME-6fcd9944c"), "pod_uid": Bytes(b"fa7a420d-c9f0-428f-b39e-d7a5da7f81ac")}), "message": Bytes(b"{\"log\":\"{\\\"@timestamp\\\":\\\"2021-11-29T16:12:31+00:00\\\",\\\"message\\\":\\\"\\\",\\\"http_bytes\\\":4461,\\\"http_req_len\\\":1375,\\\"http_config\\\":\\\"-\\\",\\\"http_method\\\":\\\"GET\\\",\\\"http_port\\\":8000,\\\"http_referrer\\\":\\\"-\\\",\\\"http_remote\\\":\\\"10.89.49.3\\\",\\\"http_status\\\":200,\\\"http_time_sec\\\":0.009,\\\"http_msec\\\":1638202351.178,\\\"http_uri\\\":\\\"/redacted-uri\\\",\\\"http_user\\\":\\\"-\\\",\\\"http_useragent\\\":\\\"sdk/0.1.0 Go/go1.16.10 (linux; amd64)\\\",\\\"http_x_fwd_for\\\":\\\"35.198.x.x, 34.95.x.x,130.211.x.x\\\",\\\"http_x_fwd_proto\\\":\\\"http\\\",\\\"http_content_length\\\":\\\"0\\\"}\\n\",\"stream\":\"stdout\",\"time\":\"2021-11-29T16:12:31.179027748Z\"}"), "source_type": Bytes(b"kubernetes_logs"), "timestamp": Timestamp(2021-11-29T16:12:31.854464303Z)}), metadata: EventMetadata { datadog_api_key: None, finalizers: EventFinalizers([]) } })
This issue happens on a busy cluster with high traffic.
Are you able to check if these logs are coming from a pod or namespace that's been deleted?
Are you able to check if these logs are coming from a pod or namespace that's been deleted?
Regarding namespace, we don't delete any namespace because it's a live cluster. And for the pod's status, I traced the one that appears in above log line and confirmed that it was removed more than an hour ago based on its last timestamp: 2021-11-22T21:11:18.528853581Z
vs Vector' timestamp: 2021-11-22T22:27:12.745573Z
Vector agent was configured delay_deletion_ms: 3600000
like this: https://github.com/vectordotdev/vector/issues/10122#issuecomment-977815309 . But it looks like I should increase deletion delay time than just 1h.
That definitely seems like a good next step, it's troubling to me that it's taking so long to process these log files though.
@spencergilbert Do you think this config (https://github.com/vectordotdev/vector/commit/9a3881fad639bae889c87052fc5318586b17d9bb) would help besides increasing delay_deletion_ms
value? I have a theory that in this high traffic cluster, container logs would be rotated very fast and Vector prioritised newest logs than older logs. It could be the cause of the pod's metadata cache was evicted long before logs were ingested.
🤔 that's definitely possible
@namm2 Curious, were you able to solve this? I've set delay_deletion_ms
as the default value only and I found the same errors that you are seeing (Failed to annotate event with namespace metadata.
). I am already running 0.19
.
@mr-karan on top of what Nam mentioned, we got rid of those errors by lowering glob_minimum_cooldown_ms (default value is 60s, we went down to 5s).
tl;dr: if you have Kubernetes nodes running pods that generate a high volume of logs and get rotated very fast, then Vector won't be able to catch all events without tuning the configuration.
I will close this issue as it is not relevant anymore.
Thanks @anas-aso I've reduced the glob_minimum_cooldown_ms
to 1s.
In addition to that, I'm running AWS EKS and over there the default docker log rotation is very less, 10MB and last 10 files. For a high traffic cluster, this may not be enough. I've increased the docker log rotation count as well.
Both the things should help 👍
Hi @anas-aso
I wasn't able to make kubernetes_logs
work in GKE
more details here: https://github.com/vectordotdev/vector/discussions/20865
Hi @anas-aso / @mr-karan
Is there anything special that you did to make kubernetes_logs
work on GKE?
Can you please suggest...
@madhur-df nothing special. The chart values are similar to this, in addition to the use of glob_minimum_cooldown_ms
Vector Version
Vector Configuration File
Debug Output
Example of failing log event returned with the mentioned error (redacted)
Expected Behavior
The log events get enriched with Kubernetes metadata (container_name, pod_name, etc)
Actual Behavior
The log event is not enriched (annotated?): none of the
.kubernetes_
fields exists and Vector keeps logging the error above.Example Data
See debug output above
Additional Context
I have tried the nightly build from yesterday (nightly-2021-11-18-debian) and set delay_deletion_ms to a high value (1 hour!) and also excluded one namespace that was running 1400+ pod. This reduced the number of failures a bit, but they didn't vanish completely.
I have checked many of the failing events, and they all seem valid and I expect them to at least have pass this function, so this condition should never be true for them.
References