vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
18.15k stars 1.6k forks source link

[kubernetes_logs] Failed to annotate event with pod metadata. #10122

Closed anas-aso closed 2 years ago

anas-aso commented 3 years ago

Vector Version

0.17.3

Vector Configuration File

data_dir: "/vector-data-dir"

sources:
  kubernetes:
    type: kubernetes_logs

transforms:
  parse_logs:
    type: remap
    inputs:
      - kubernetes
    source: |-
      # Do some parsing (mostly parse_json(...))

sinks:
  elasticsearch:
    type: elasticsearch
    inputs:
      - parse_logs
    endpoint: http://<ES_SERVICE_NAME>.<NAMESPACE_NAME>.svc:<PORT>
    encoding:
      timestamp_format: rfc3339

Debug Output

Example of failing log event returned with the mentioned error (redacted)

Log(LogEvent { fields: Map({"file": Bytes(b"/var/log/pods/my-namespace_my-pod-7956799d95-rf6h7_37a73b39-ac9c-42ad-a286-c084cd8dc775/my-container/0.log"), "message": Bytes(b"some json log line"), "source_type": Bytes(b"kubernetes_logs"), "timestamp": Timestamp(2021-11-19T14:45:00.014194282Z)}), metadata: EventMetadata { datadog_api_key: None, finalizers: EventFinalizers([]) } })

Expected Behavior

The log events get enriched with Kubernetes metadata (container_name, pod_name, etc)

Actual Behavior

The log event is not enriched (annotated?): none of the .kubernetes_ fields exists and Vector keeps logging the error above.

Example Data

See debug output above

Additional Context

I have tried the nightly build from yesterday (nightly-2021-11-18-debian) and set delay_deletion_ms to a high value (1 hour!) and also excluded one namespace that was running 1400+ pod. This reduced the number of failures a bit, but they didn't vanish completely.

I have checked many of the failing events, and they all seem valid and I expect them to at least have pass this function, so this condition should never be true for them.

References

svenmueller commented 2 years ago

Here is configuration used for the Vector agents:

customConfig:
  data_dir: "/vector-data-dir"
  sources:
    kubernetes:
      type: kubernetes_logs
      max_line_bytes: 131072
      delay_deletion_ms: 3600000 
    internal_logs:
      type: internal_logs
    internal_metrics:
      type: internal_metrics

  transforms:
    parse_internal_logs:
      type: remap
      inputs:
        - internal_logs
      source: |-
        .cloud_provider.service = "vector"
        .host = "${HOSTNAME}"
  sinks:
    vector-aggregator:
      type: vector
      inputs:
        - kubernetes
        - parse_internal_logs
      address: vector-aggregator-haproxy.logging.svc:9080
      healthcheck:
        enabled: true
    prometheus-exporter:
      type: prometheus_exporter
      inputs:
        - internal_metrics
      address: 0.0.0.0:9090
spencergilbert commented 2 years ago

Sorry for the delay - @svenmueller is Vector logging Failed to annotate event with pod metadata (or namespace metadata)?

namm2 commented 2 years ago

@spencergilbert Both: Here is a Vector agent's log after redacted:

2021-11-22T22:27:12.745573Z WARN source{component_kind="source" component_id=kubernetes component_type=kubernetes_logs component_name=kubernetes}: vector::internal_events::kubernetes_logs: Failed to annotate event with pod metadata. event=Log(LogEvent { fields: Map({"file": Bytes(b"/var/log/pods/NAMESPACE_NAME_POD_NAME-74b48d875b-jb9qz_60894207-4c53-459f-9170-448a7626216e/POD_NAME/0.log"), "message": Bytes(b"{\"log\":\"{\\\"queue\\\":\\\"QUEUE_NAME\\\"},\\\"kv\\\":{\\\"queue\\\":\\\"QUEUE_NAME\\\",\\\"topic\\\":\\\"TOPIC_NAME\\\",\\\"streamId\\\":\\\"f6c6bbdd-8652-48d3-afc5-b1504423eead\\\",\\\"commitNr\\\":1,\\\"nrOfRetries\\\":0},\\\"thread\\\":\\\"THREAD_NAME\\\",\\\"timestamp\\\":\\\"2021-11-22T21:10:52.997Z\\\",\\\"logger\\\":\\\"LOGGER_NAME\\\",\\\"severity\\\":\\\"INFO\\\",\\\"message\\\":\\\"[QUEUE_NAME, topic=TOPIC_NAME, streamId=f6c6bbdd-8652-48d3-afc5-b1504423eead, commitNr=1, nrOfRetries=0] Processed in 32 milliseconds\\\"}\\n\",\"stream\":\"stdout\",\"time\":\"2021-11-22T21:10:52.997543148Z\"}"), "source_type": Bytes(b"kubernetes_logs"), "timestamp": Timestamp(2021-11-22T22:27:12.745452449Z)}), metadata: EventMetadata { datadog_api_key: None, finalizers: EventFinalizers([]) } })

And: 2021-11-29T16:12:31.854590Z WARN source{component_kind="source" component_id=kubernetes component_type=kubernetes_logs component_name=kubernetes}: vector::internal_events::kubernetes_logs: Failed to annotate event with namespace metadata. event=Log(LogEvent { fields: Map({"file": Bytes(b"/var/log/pods/NAMESPACE_NAME_POD_NAME-6fcd9944c-ddtpz_fa7a420d-c9f0-428f-b39e-d7a5da7f81ac/nginx/0.log"), "kubernetes": Map({"container_id": Bytes(b"docker://8c9ec8710724df6b946d8c0ab43f96f9281eea7ff26787aeb923c6764e1fa953"), "container_image": Bytes(b"nginx:1.16.1"), "container_name": Bytes(b"nginx"), "pod_annotations": Map({"configChecksum": Bytes(b"0a7c51fd3bae693348190ce6e6fabb0ab801ad7890409b8197d803150afe2272"), "prometheus.io/path": Bytes(b"/metrics"), "prometheus.io/port": Bytes(b"8081"), "prometheus.io/scrape": Bytes(b"true")}), "pod_ip": Bytes(b"10.89.61.87"), "pod_ips": Array([Bytes(b"10.89.61.87")]), "pod_labels": Map({"app": Bytes(b"POD_NAME"), "chart": Bytes(b"CHART_NAME-1.0.8"), "pod-template-hash": Bytes(b"6fcd9944c"), "release": Bytes(b"NAMESPACE_NAME-CHART_NAME"), "version": Bytes(b"c20b86c48ce3e11007e026d84490fef2486faf6f")}), "pod_name": Bytes(b"POD_NAME-6fcd9944c-ddtpz"), "pod_namespace": Bytes(b"NAMESPACE_NAME"), "pod_node_name": Bytes(b"gke-nodepool-efd05fde-sp8k"), "pod_owner": Bytes(b"ReplicaSet/POD_NAME-6fcd9944c"), "pod_uid": Bytes(b"fa7a420d-c9f0-428f-b39e-d7a5da7f81ac")}), "message": Bytes(b"{\"log\":\"{\\\"@timestamp\\\":\\\"2021-11-29T16:12:31+00:00\\\",\\\"message\\\":\\\"\\\",\\\"http_bytes\\\":4461,\\\"http_req_len\\\":1375,\\\"http_config\\\":\\\"-\\\",\\\"http_method\\\":\\\"GET\\\",\\\"http_port\\\":8000,\\\"http_referrer\\\":\\\"-\\\",\\\"http_remote\\\":\\\"10.89.49.3\\\",\\\"http_status\\\":200,\\\"http_time_sec\\\":0.009,\\\"http_msec\\\":1638202351.178,\\\"http_uri\\\":\\\"/redacted-uri\\\",\\\"http_user\\\":\\\"-\\\",\\\"http_useragent\\\":\\\"sdk/0.1.0 Go/go1.16.10 (linux; amd64)\\\",\\\"http_x_fwd_for\\\":\\\"35.198.x.x, 34.95.x.x,130.211.x.x\\\",\\\"http_x_fwd_proto\\\":\\\"http\\\",\\\"http_content_length\\\":\\\"0\\\"}\\n\",\"stream\":\"stdout\",\"time\":\"2021-11-29T16:12:31.179027748Z\"}"), "source_type": Bytes(b"kubernetes_logs"), "timestamp": Timestamp(2021-11-29T16:12:31.854464303Z)}), metadata: EventMetadata { datadog_api_key: None, finalizers: EventFinalizers([]) } })

This issue happens on a busy cluster with high traffic.

spencergilbert commented 2 years ago

Are you able to check if these logs are coming from a pod or namespace that's been deleted?

namm2 commented 2 years ago

Are you able to check if these logs are coming from a pod or namespace that's been deleted?

Regarding namespace, we don't delete any namespace because it's a live cluster. And for the pod's status, I traced the one that appears in above log line and confirmed that it was removed more than an hour ago based on its last timestamp: 2021-11-22T21:11:18.528853581Z vs Vector' timestamp: 2021-11-22T22:27:12.745573Z

Vector agent was configured delay_deletion_ms: 3600000 like this: https://github.com/vectordotdev/vector/issues/10122#issuecomment-977815309 . But it looks like I should increase deletion delay time than just 1h.

spencergilbert commented 2 years ago

That definitely seems like a good next step, it's troubling to me that it's taking so long to process these log files though.

namm2 commented 2 years ago

@spencergilbert Do you think this config (https://github.com/vectordotdev/vector/commit/9a3881fad639bae889c87052fc5318586b17d9bb) would help besides increasing delay_deletion_ms value? I have a theory that in this high traffic cluster, container logs would be rotated very fast and Vector prioritised newest logs than older logs. It could be the cause of the pod's metadata cache was evicted long before logs were ingested.

spencergilbert commented 2 years ago

🤔 that's definitely possible

mr-karan commented 2 years ago

@namm2 Curious, were you able to solve this? I've set delay_deletion_ms as the default value only and I found the same errors that you are seeing (Failed to annotate event with namespace metadata.). I am already running 0.19.

anas-aso commented 2 years ago

@mr-karan on top of what Nam mentioned, we got rid of those errors by lowering glob_minimum_cooldown_ms (default value is 60s, we went down to 5s).

tl;dr: if you have Kubernetes nodes running pods that generate a high volume of logs and get rotated very fast, then Vector won't be able to catch all events without tuning the configuration.

I will close this issue as it is not relevant anymore.

mr-karan commented 2 years ago

Thanks @anas-aso I've reduced the glob_minimum_cooldown_ms to 1s.

In addition to that, I'm running AWS EKS and over there the default docker log rotation is very less, 10MB and last 10 files. For a high traffic cluster, this may not be enough. I've increased the docker log rotation count as well.

Both the things should help 👍

madhur-df commented 3 months ago

Hi @anas-aso

I wasn't able to make kubernetes_logs work in GKE

more details here: https://github.com/vectordotdev/vector/discussions/20865

madhur-df commented 3 months ago

Hi @anas-aso / @mr-karan

Is there anything special that you did to make kubernetes_logs work on GKE?

Can you please suggest...

anas-aso commented 3 months ago

@madhur-df nothing special. The chart values are similar to this, in addition to the use of glob_minimum_cooldown_ms