Vector randomly stops shipping certain k8s logs

danthegoodman1 commented 2 years ago

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

After a while, our vector daemonset will randomly stop shipping logs for a select service (some other pods will keep shipping logs)

Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: vector
  namespace: observability
  labels:
    app.kubernetes.io/name: vector
    app.kubernetes.io/instance: vector
    app.kubernetes.io/component: Agent
    app.kubernetes.io/version: "0.20.0-distroless-libc"
data:
  agent.yaml: |
    data_dir: /vector-data-dir
    api:
      enabled: true
      address: 127.0.0.1:8686
      playground: false
    sources:
      kubernetes_logs:
        type: kubernetes_logs
      host_metrics:
        filesystem:
          devices:
            excludes: [binfmt_misc]
          filesystems:
            excludes: [binfmt_misc]
          mountPoints:
            excludes: ["*/proc/sys/fs/binfmt_misc"]
        type: host_metrics
      internal_metrics:
        type: internal_metrics
    transforms:
      setlevel:
        type: remap
        inputs: [kubernetes_logs]
        source: |-
          .temp = parse_json!(.message)
          if !exists(parse_json!(.message).level) {
            .level = "other"
          } else {
            .level = .temp.level
          }
          if exists(.temp.uri) {
            .uri = .temp.uri
          }
          if exists(.temp.msg) {
            .msg = .temp.msg
          }
          if exists(.temp.lat) {
            .lat = .temp.lat
            .lon = .temp.lon
          }
          del(.temp)
    sinks:
      prom_exporter:
        type: prometheus_exporter
        inputs: [host_metrics, internal_metrics]
        address: 0.0.0.0:9090
      # stdout:
      #   type: console
      #   inputs: [setlevel]
      #   encoding:
      #     codec: json
      loki:
        type: loki
        inputs:
          - "setlevel"
        endpoint: https://logs-prod-us-central1.grafana.net
        compression: gzip
        # remove_label_fields: true
        encoding:
          codec: json
        auth:
          password: ${LOKI_PASSWORD}
          user: "${LOKI_USERNAME}"
          strategy: basic
        labels:
          namespace: "{{ kubernetes.pod_namespace }}"
          pod: "{{ kubernetes.pod_name }}"
          level: "{{ level }}"
          app_label: "{{ kubernetes.pod_label.app }}"
          node: "{{ kubernetes.pod_node_name }}"
          pod_owner: "{{ kubernetes.pod_owner }}"
          cluster: ${CLUSTER_NAME}


### Version

timberio/vector:0.20.0-distroless-libc

### Debug Output

```text
Working on trying to get relevant debugging output, sends a lot of TRACE logs currently :P

Example Data

{"func":"github.com/xxx/xxx/crdb.ConnectToDB","level":"debug","msg":"Connected to CRDB","time":"2022-03-29T17:43:21.015Z"}

Additional Context

Only logs from vector:

2022-03-29T17:35:57.183220Z  WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Internal log [Mapping failed with event.] has
 been rate limited 10 times.
 2022-03-29T17:35:57.183241Z  WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Mapping failed with event. error="function cal
 l error for \"parse_json\" at (8:29): unable to parse json: trailing characters at line 1 column 5" internal_log_rate_secs=30
 2022-03-29T17:35:57.200043Z  WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Internal log [Mapping failed with event.] is b
 eing rate limited.
 2022-03-29T17:36:32.201827Z  WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Internal log [Mapping failed with event.] has
 been rate limited 8 times.
 2022-03-29T17:36:32.201877Z  WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Mapping failed with event. error="function cal
 l error for \"parse_json\" at (8:29): unable to parse json: trailing characters at line 1 column 5" internal_log_rate_secs=30
 2022-03-29T17:36:32.437566Z  WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Internal log [Mapping failed with event.] is b
 eing rate limited.

References

Similar to this I've had in the past https://github.com/vectordotdev/vector/discussions/8634

danthegoodman1 commented 2 years ago

I will add that restart vector seems to solve the issue, but that's not a sustainable solution.

danthegoodman1 commented 2 years ago

It's also hard to get debug logs because this is not something that I can reproduce quickly, it's usually something we notice after 3-10 days when we try to look at logs and there are none.

jszwedko commented 2 years ago

Hi @danthegoodman1 !

It seems like some of the incoming events are not JSON. I might suggest augmenting your remap program to log the event that fails to parse as JSON like this:

.temp, err  = parse_json(.message)
if err != null {
  log("failed to parse .message as JSON: " + .message)
} else {
  // message successfully parsed as JSON
}

jszwedko commented 2 years ago

As an aside, you are currently parsing .message as JSON twice. I'd recommend making:

if !exists(parse_json!(.message).level) {

instead

f !exists(.temp.level) {

danthegoodman1 commented 2 years ago

Thanks for the quick response @jszwedko , very few of our logs are not JSON (some panics, some startup ones) but an enormous majority are JSON logs and they are the ones not coming through.

Will add the suggestions in the meantime

jszwedko commented 2 years ago

@danthegoodman1 ah, I see. The log messages you shared above just indicate JSON parsing failures. Are there more error logs?

danthegoodman1 commented 2 years ago

No, that is what is so strange @jszwedko, makes it real hard to debug 🙃

danthegoodman1 commented 2 years ago

Also getting this, trying to work through why this is

 error[E630]: fallible argument
   ┌─ :3:7
   │
 3 │   log("failed to parse .message as JSON: " + .message)
   │       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   │       │
   │       this expression can fail
   │       handle the error before passing it in as an argument
   │
   = see documentation about error handling at https://errors.vrl.dev/#handling
   = see language documentation at https://vrl.dev

Will update with the solution when I find it

danthegoodman1 commented 2 years ago

Updated to

log("failed to parse .message as JSON: " + .message ?? "no mesage!")

danthegoodman1 commented 2 years ago

Here are some updated logs, seems like some of the JSON logs are the pods we don't control in EKS :P

 2022-03-29T18:27:59.052628Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: level=debug ts=2022-03-29T
 18:27:49.577Z caller=cluster.go:329 component=cluster memberlist=\"2022/03/29 18:27:49 [DEBUG] memberlist: Initiating push/pull sync with: 01FYC8DZXMP6ENGHGKM4PBK61Q 10.0.167.205:9094\\n\"" internal_log_rate_secs=1 vrl
 _position=53
 2022-03-29T18:27:59.052677Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
 7.170407       1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] is being rate limited.
 2022-03-29T18:28:12.427827Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
 7.170407       1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] has been rate limited 1 times.
 2022-03-29T18:28:12.427844Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: 2022/03/29 18:28:02 http:
 TLS handshake error from 10.0.167.22:11247: EOF" internal_log_rate_secs=1 vrl_position=53
 2022-03-29T18:28:12.427861Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
 7.170407       1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] is being rate limited.
 2022-03-29T18:28:17.165804Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
 7.170407       1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] has been rate limited 2 times.
 2022-03-29T18:28:17.165830Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: 2022/03/29 18:28:17 http:
 TLS handshake error from 10.0.167.22:60721: EOF" internal_log_rate_secs=1 vrl_position=53
 2022-03-29T18:28:19.164409Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: level=debug ts=2022-03-29T
 18:28:15.586Z caller=cluster.go:329 component=cluster memberlist=\"2022/03/29 18:28:15 [DEBUG] memberlist: Stream connection from=10.0.167.205:34340\\n\"" internal_log_rate_secs=1 vrl_position=53
 2022-03-29T18:28:19.164463Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
 7.170407       1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] is being rate limited.
 2022-03-29T18:28:27.218994Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
 7.170407       1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] has been rate limited 1 times.
 2022-03-29T18:28:27.219014Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: 2022/03/29 18:28:22 http:
 TLS handshake error from 10.0.167.22:27699: EOF" internal_log_rate_secs=1 vrl_position=53
 2022-03-29T18:28:37.171126Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: 2022/03/29 18:28:27 http:
 TLS handshake error from 10.0.167.22:34561: EOF" internal_log_rate_secs=1 vrl_position=53
 2022-03-29T18:28:37.171168Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
 7.170407       1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] is being rate limited.
 2022-03-29T18:28:38.940324Z  INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_logs}:file_server: vector::internal_events::file::source: Found new
  file to watch. file=/var/log/pods/eth-scanner_eth-scanner-27476308-hbml9_05116106-4a27-49a2-ba6e-c811190e304a/eth-scanner/0.log
 2022-03-29T18:28:39.221000Z  INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_logs}:file_server: vector::internal_events::file::source: Stopped w
 atching file. file=/var/log/pods/eth-scanner_eth-scanner-27476306-gkzmk_0a02fc81-f83a-4d3a-925b-37febe1dab62/eth-scanner/0.log
 2022-03-29T18:28:47.167084Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
 7.170407       1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] has been rate limited 2 times.
 2022-03-29T18:28:47.167106Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: 2022/03/29 18:28:47 http:
 TLS handshake error from 10.0.167.22:48077: EOF" internal_log_rate_secs=1 vrl_position=53
 2022-03-29T18:28:47.234267Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
 7.170407       1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] is being rate limited.
 2022-03-29T18:28:52.426194Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
 7.170407       1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] has been rate limited 1 times.
 2022-03-29T18:28:52.426218Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: 2022/03/29 18:28:52 http:
 TLS handshake error from 10.0.167.22:62639: EOF" internal_log_rate_secs=1 vrl_position=53
 2022-03-29T18:28:59.187233Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: level=debug ts=2022-03-29T
 18:28:49.579Z caller=cluster.go:329 component=cluster memberlist=\"2022/03/29 18:28:49 [DEBUG] memberlist: Initiating push/pull sync with: xxx 10.0.167.205:9094\\n\"" internal_log_rate_secs=1 vrl
 _position=53
 2022-03-29T18:28:59.187428Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
 7.170407       1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] is being rate limited.

danthegoodman1 commented 2 years ago

This has just started happening again, no error or warning logs. Anything else I should be looking for or trying? We can't be dropping our production logs.

danthegoodman1 commented 2 years ago

@jszwedko one thing I am noticing is that we don't get the logs about vector finding new files. This is easier to reproduce on things that come and go relatively quickly like cron jobs (we leave the completed pod around so logs can be collected)

Hades32 commented 2 years ago

One thing to add: We checked the mounted /var/logs and the files with the missing logs are there, it seems like the event about the new files doesn't get triggered. We're running on AWS EKS and /proc/sys/fs/inotify/max_user_watches is 524288.

wtchangdm commented 2 years ago

@danthegoodman1 If new logs were not "found" in time, maybe try to reduce glob_minimum_cooldown_ms?

It's 60000 ms in kubernetes_logs source, but 1000 ms in file source.

danthegoodman1 commented 2 years ago

@wtchangdm will try, but it’s the case for long running pods too where it just stops recording logs for those pods. Those files definitely stay around longer than 60s

wtchangdm commented 2 years ago

@danthegoodman1 BTW, there is an issue I am also tracking: https://github.com/vectordotdev/vector/issues/8616. Not sure if it's related, though.

danthegoodman1 commented 2 years ago

Interesting, but we don’t get any warnings or errors like that does.

jszwedko commented 2 years ago

@danthegoodman1 could you try with the latest release, 0.21.1? The kubernetes_logs source was rewritten there to use the community supported kube-rs crate to interact with Kubernetes that we expect may have fixed a few issues.

Though, in this case, it does seem like the issue is with file discovery rather than metadata enrichment.

danthegoodman1 commented 2 years ago

@jszwedko I will say that so far we haven't noticed any issues since we dropped the glob cooldown to 8000, but we also surely have not been in our logs as much.

Will update now and see how it works

danthegoodman1 commented 2 years ago

@jszwedko This went very not well, was printing this extremely quickly:

 2022-04-25T18:40:35.870916Z  WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "pods is forbid
 den: User \"system:serviceaccount:observability:vector\" cannot list resource \"pods\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 }))
 2022-04-25T18:40:35.871396Z  WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "namespaces is
 forbidden: User \"system:serviceaccount:observability:vector\" cannot list resource \"namespaces\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 }))
 2022-04-25T18:40:35.872182Z  WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "pods is forbid
 den: User \"system:serviceaccount:observability:vector\" cannot list resource \"pods\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 }))
 2022-04-25T18:40:35.873367Z  WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "namespaces is
 forbidden: User \"system:serviceaccount:observability:vector\" cannot list resource \"namespaces\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 }))
 2022-04-25T18:40:35.874306Z  WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "pods is forbid
 den: User \"system:serviceaccount:observability:vector\" cannot list resource \"pods\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 }))

danthegoodman1 commented 2 years ago

~Ah ok I needed to update the cluster role tags too :)~ still getting it...

spencergilbert commented 2 years ago

@jszwedko This went very not well, was printing this extremely quickly:

 2022-04-25T18:40:35.870916Z  WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "pods is forbid
 den: User \"system:serviceaccount:observability:vector\" cannot list resource \"pods\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 }))
 2022-04-25T18:40:35.871396Z  WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "namespaces is
 forbidden: User \"system:serviceaccount:observability:vector\" cannot list resource \"namespaces\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 }))
 2022-04-25T18:40:35.872182Z  WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "pods is forbid
 den: User \"system:serviceaccount:observability:vector\" cannot list resource \"pods\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 }))
 2022-04-25T18:40:35.873367Z  WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "namespaces is
 forbidden: User \"system:serviceaccount:observability:vector\" cannot list resource \"namespaces\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 }))
 2022-04-25T18:40:35.874306Z  WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "pods is forbid
 den: User \"system:serviceaccount:observability:vector\" cannot list resource \"pods\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 }))

The ClusterRole needs to allow for the list verb as well now, see here

danthegoodman1 commented 2 years ago

yep, just made that change :)

danthegoodman1 commented 2 years ago

After sending ~50log/s I can see the following errors show, then logs stop shipping. It prints a ton of the error logs.

It is able to ship them for about 30 seconds before it stops sending logs

2022-04-26T13:21:29.462788Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_log
 s}: vector::internal_events::kubernetes_logs: Failed to annotate event with pod metadata. event=Log(LogEvent { fields: Object({"file": Bytes(b"/var/log/po
 ds/solana-scanner_solanascanner-1_81d00edf-3189-4a43-867f-3ef46c17c9cd/solanascanner/0.log"), "message": Bytes(b"{\"log\":\"{\\\"func\\\":\\\"main.StoreSc
 annerSlotOffset\\\",\\\"level\\\":\\\"debug\\\",\\\"msg\\\":\\\"Writing slot 131412285 for consumer solanascanner-1\\\",\\\"time\\\":\\\"2022-04-26T13:21:
 29.450Z\\\"}\\n\",\"stream\":\"stderr\",\"time\":\"2022-04-26T13:21:29.45062706Z\"}"), "source_type": Bytes(b"kubernetes_logs"), "timestamp": Timestamp(20
 22-04-26T13:21:29.462766592Z)}), metadata: EventMetadata { datadog_api_key: None, splunk_hec_token: None, finalizers: EventFinalizers([]), schema_definiti
 on: Definition { collection: Collection { known: {}, unknown: None }, meaning: {}, optional: {} } } }) error_code="annotation_failed" error_type="reader_f
 ailed" stage="processing"
 2022-04-26T13:21:29.497575Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_log
 s}: vector::internal_events::kubernetes_logs: Failed to annotate event with pod metadata. event=Log(LogEvent { fields: Object({"file": Bytes(b"/var/log/po
 ds/solana-scanner_solanascanner-1_81d00edf-3189-4a43-867f-3ef46c17c9cd/solanascanner/0.log"), "message": Bytes(b"{\"log\":\"{\\\"func\\\":\\\"main.StoreSc
 annerSlotOffset\\\",\\\"level\\\":\\\"debug\\\",\\\"msg\\\":\\\"Wrote offset in 35.128858ms\\\",\\\"time\\\":\\\"2022-04-26T13:21:29.485Z\\\"}\\n\",\"stre
 am\":\"stderr\",\"time\":\"2022-04-26T13:21:29.485822529Z\"}"), "source_type": Bytes(b"kubernetes_logs"), "timestamp": Timestamp(2022-04-26T13:21:29.49755
 5692Z)}), metadata: EventMetadata { datadog_api_key: None, splunk_hec_token: None, finalizers: EventFinalizers([]), schema_definition: Definition { collec
 tion: Collection { known: {}, unknown: None }, meaning: {}, optional: {} } } }) error_code="annotation_failed" error_type="reader_failed" stage="processin
 g"
 2022-04-26T13:21:29.505901Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_log
 s}: vector::internal_events::kubernetes_logs: Failed to annotate event with pod metadata. event=Log(LogEvent { fields: Object({"file": Bytes(b"/var/log/po
 ds/solana-scanner_solanascanner-1_81d00edf-3189-4a43-867f-3ef46c17c9cd/solanascanner/0.log"), "message": Bytes(b"{\"log\":\"{\\\"func\\\":\\\"main.ScanBlo
 ck\\\",\\\"level\\\":\\\"debug\\\",\\\"msg\\\":\\\"Got slots in 167.388115ms got slots [131412285], len 1\\\",\\\"time\\\":\\\"2022-04-26T13:21:29.502Z\\\
 "}\\n\",\"stream\":\"stderr\",\"time\":\"2022-04-26T13:21:29.502718488Z\"}"), "source_type": Bytes(b"kubernetes_logs"), "timestamp": Timestamp(2022-04-26T
 13:21:29.505885264Z)}), metadata: EventMetadata { datadog_api_key: None, splunk_hec_token: None, finalizers: EventFinalizers([]), schema_definition: Defin
 ition { collection: Collection { known: {}, unknown: None }, meaning: {}, optional: {} } } }) error_code="annotation_failed" error_type="reader_failed" st
 age="processing"
 2022-04-26T13:21:29.505972Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_log
 s}: vector::internal_events::kubernetes_logs: Failed to annotate event with pod metadata. event=Log(LogEvent { fields: Object({"file": Bytes(b"/var/log/po
 ds/solana-scanner_solanascanner-1_81d00edf-3189-4a43-867f-3ef46c17c9cd/solanascanner/0.log"), "message": Bytes(b"{\"log\":\"{\\\"func\\\":\\\"main.ScanBlo
 ck\\\",\\\"level\\\":\\\"debug\\\",\\\"msg\\\":\\\"Got same top slot 131412285, sleeping for 250ms\\\",\\\"time\\\":\\\"2022-04-26T13:21:29.502Z\\\"}\\n\"
 ,\"stream\":\"stderr\",\"time\":\"2022-04-26T13:21:29.502731568Z\"}"), "source_type": Bytes(b"kubernetes_logs"), "timestamp": Timestamp(2022-04-26T13:21:2
 9.505966965Z)}), metadata: EventMetadata { datadog_api_key: None, splunk_hec_token: None, finalizers: EventFinalizers([]), schema_definition: Definition {
  collection: Collection { known: {}, unknown: None }, meaning: {}, optional: {} } } }) error_code="annotation_failed" error_type="reader_failed" stage="pr
 ocessing"
 2022-04-26T13:21:29.576577Z  INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_log
 s}:file_server: vector::internal_events::file::source: Stopped watching file. file=/var/log/pods/solana-scanner_solanascanner-1_81d00edf-3189-4a43-867f-3e
 f46c17c9cd/solanascanner/0.log

danthegoodman1 commented 2 years ago

It does not affect watching other pods, and CPU never exceeds ~20 mCPU and MEM never exceeds 33 MB for the vector pods

danthegoodman1 commented 2 years ago

spencergilbert commented 2 years ago

Seems related to #10122 (comment)

Possibly - the other answer would be that solanascanner was deleted in cluster before all of it's log lines were processed. When that happens delay_deletion_ms comes into play as it determines how long we hold onto old metadata before dumping it ourselves. Increasing that value will increase memory usage (however much it takes to hold onto the resource metadata), but allow for more time to enrich logs from deleted pods.

danthegoodman1 commented 2 years ago

It seems reducing the glob time to 3s is working at this rate. I also reduced the number of logs printed to reduce the frequency of log rotation

danthegoodman1 commented 2 years ago

Still getting the errors, and the pods are not being removed, logs are coming in much slower now as well

spencergilbert commented 2 years ago

Still getting the errors, and the pods are not being removed, logs are coming in much slower now as well

Can you share the setup you're using so I can reproduce?

danthegoodman1 commented 2 years ago

dropped the glob cooldwon to 2s and it seems to be holding steady so far, here is the setup, otherwise it is the default k8s deployment:

apiVersion: v1
kind: ConfigMap
metadata:
  name: vector
  namespace: observability
  labels:
    app.kubernetes.io/name: vector
    app.kubernetes.io/instance: vector
    app.kubernetes.io/component: Agent
    app.kubernetes.io/version: "0.21.1-distroless-libc"
data:
  agent.yaml: |
    data_dir: /vector-data-dir
    api:
      enabled: true
      address: 127.0.0.1:8686
      playground: false
    sources:
      kubernetes_logs:
        type: kubernetes_logs
        glob_minimum_cooldown_ms: 2000
      host_metrics:
        filesystem:
          devices:
            excludes: [binfmt_misc]
          filesystems:
            excludes: [binfmt_misc]
          mountPoints:
            excludes: ["*/proc/sys/fs/binfmt_misc"]
        type: host_metrics
      internal_metrics:
        type: internal_metrics
    transforms:
      setlevel:
        type: remap
        inputs: [kubernetes_logs]
        source: |-
          .temp, err = parse_json(.message)
          if err != null {
            .json_parse_err = err
          } else {
            if !exists(.temp.level) {
              .level = "other"
            } else {
              .level = .temp.level
            }
            if exists(.temp.uri) {
              .uri = .temp.uri
            }
            if exists(.temp.msg) {
              .msg = .temp.msg
            }
            if exists(.temp.lat) {
              .lat = .temp.lat
              .lon = .temp.lon
            }
            del(.temp)
          }
    sinks:
      prom_exporter:
        type: prometheus_exporter
        inputs: [host_metrics, internal_metrics]
        address: 0.0.0.0:9090
      # stdout:
      #   type: console
      #   inputs: [setlevel]
      #   encoding:
      #     codec: json
      loki:
        type: loki
        inputs:
          - "setlevel"
        endpoint: https://logs-prod-us-central1.grafana.net
        compression: gzip
        # remove_label_fields: true
        encoding:
          codec: json
        auth:
          password: ${LOKI_PASSWORD}
          user: "${LOKI_USERNAME}"
          strategy: basic
        labels:
          namespace: "{{ kubernetes.pod_namespace }}"
          pod: "{{ kubernetes.pod_name }}"
          level: "{{ level }}"
          app_label: "{{ kubernetes.pod_label.app }}"
          node: "{{ kubernetes.pod_node_name }}"
          pod_owner: "{{ kubernetes.pod_owner }}"
          cluster: ${CLUSTER_NAME}

spencergilbert commented 2 years ago

@danthegoodman1 are you running something like flog to generate load/logs or is this with your actual applications?

danthegoodman1 commented 2 years ago

actual app

treethought commented 2 years ago

I am currently experiencing the same issue. We had previously downgraded vector due to the Desync errors described in https://github.com/vectordotdev/vector/issues/8616.

After upgrading to 0.21.0 and then 0.21.1, it seems the Desync errors have completely gone away. However, after a period of time, logs for certain workloads silently stop getting picked up and we see a large increase in vector_component_errors_total for the kubernetes_logs source with majority of the errors are reader_failed.

Performing a rolling restart of the vector daemonset always seems to restore things to a better state (indicated by the gaps in the plot below) however after a while (seems to be about once a day or so) the errors return with no real indication aside from the internal metrics.

Screenshot from 2022-05-09 11-07-34

We have a pretty simple config for the source, and have incorporated the suggestions about glob_min_cooldown_ms and delay_deletion_ms described in various issues.

    kubernetes_logs:
      auto_partial_merge: true
      fingerprint_lines: 1
      # set to lower value than default 60s
      # to better handle high log volume / rotation
      # see https://github.com/vectordotdev/vector/issues/10122#issuecomment-977815309
      glob_minimum_cooldown_ms: 5000 # 5 seconds

      # increase delay deletion to allow vector
      # to continue processing logs for deleted/scaled down pods
      delay_deletion_ms: 3600000
      self_node_name: ${VECTOR_SELF_NODE_NAME}
      type: kubernetes_logs

jszwedko commented 2 years ago

@treethought would you be able to share the logs associated with those errors? They could provide additional clues.

aikoven commented 2 years ago

We're experiencing the same problem. The time when the logs stop shipping correlates with annotation_failed errors:

2022-05-30T04:31:09.092406Z ERROR source{component_kind="source" component_id=kubernetes_logs
component_type=kubernetes_logs component_name=kubernetes_logs}: vector::internal_events::kubernetes_logs:
Failed to annotate event with pod metadata. event=Log(LogEvent { fields: Object({"file":
Bytes(b"/var/log/pods/*redacted*/0.log"), "message": Bytes(b"2022-05-30T11:31:09.04624439+07:00 stderr F
*redacted*"), "source_type": Bytes(b"kubernetes_logs"), "timestamp": Timestamp(2022-05-30T04:31:09.092358889Z)}),
metadata: EventMetadata { datadog_api_key: None, splunk_hec_token: None, finalizers: EventFinalizers([]),
schema_definition: Definition { collection: Collection { known: {}, unknown: None }, meaning: {}, optional: {} } } })
error_code="annotation_failed" error_type="reader_failed" stage="processing"

spencergilbert commented 2 years ago

We're experiencing the same problem. The time when the logs stop shipping correlates with annotation_failed errors:

2022-05-30T04:31:09.092406Z ERROR source{component_kind="source" component_id=kubernetes_logs
component_type=kubernetes_logs component_name=kubernetes_logs}: vector::internal_events::kubernetes_logs:
Failed to annotate event with pod metadata. event=Log(LogEvent { fields: Object({"file":
Bytes(b"/var/log/pods/*redacted*/0.log"), "message": Bytes(b"2022-05-30T11:31:09.04624439+07:00 stderr F
*redacted*"), "source_type": Bytes(b"kubernetes_logs"), "timestamp": Timestamp(2022-05-30T04:31:09.092358889Z)}),
metadata: EventMetadata { datadog_api_key: None, splunk_hec_token: None, finalizers: EventFinalizers([]),
schema_definition: Definition { collection: Collection { known: {}, unknown: None }, meaning: {}, optional: {} } } })
error_code="annotation_failed" error_type="reader_failed" stage="processing"

What is your delay_deletion_ms currently configured to?

aikoven commented 2 years ago

What is your delay_deletion_ms currently configured to?

The default value.

spencergilbert commented 2 years ago

What is your delay_deletion_ms currently configured to?

The default value.

Has the pod in the error log been deleted/terminated from the cluster? If so you'll want to increase the delay_deletion_ms to account for processing of logs that may remain after the pod had been terminated.

aikoven commented 2 years ago

Has the pod in the error log been deleted/terminated from the cluster? If so you'll want to increase the delay_deletion_ms to account for processing of logs that may remain after the pod had been terminated.

It hasn't. The pod was alive for at least an hour after these errors, and no logs were received since that. All caught up after Vector restart.

The same Vector agent continued to work fine with other pods.

Is there any sort of additional telemetry we could enable to help debug this? We're encountering this problem every day.

In the meantime we could probably introduce a liveness probe triggered by error count metric.

spencergilbert commented 2 years ago

Is there any sort of additional telemetry we could enable to help debug this? We're encountering this problem every day.

Definitely seems like we need to, opened https://github.com/vectordotdev/vector/issues/12909

treethought commented 2 years ago

@jszwedko sorry for the delayed response. I don't have any logs associated with the errors at this time, but would like to add that my experience aligns with @aikoven's description. This typically occurs only for a handful of actively running pods, while others continue to have logs picked up fine.

imcheck commented 2 years ago

Hi, I faced the same issue on kubernetes source. I guessed the vector agent never release file descriptor when some log files were deleted. I got a workaround for it by restarting agents and aggregators.

imcheck commented 2 years ago

Is there anyone who know the answer of this problem? or is there any option in kubernetes logs source like remove_after_secs in file source? I think this is related to #8101.

Genry1 commented 2 years ago

Hi, I've encountered similar issue vector::internal_events::kubernetes_logs: Failed to annotate event with pod metadata., and the root cause seemed to be in the pod name, which had the same name as the previous pod which was recreated. It looks like it deletes the old pod name from the metadata cache, when the timeout hits, but as it has the same name as the newly created pod, then the new one doesn't have the pod name in the metadata cache anymore and it fails with that error. I think some retry for pod metadata retrieval is necessary, if it's missing in the cache.

kitos9112 commented 2 years ago

Similar sentiment after reading @imcheck and @Genry1 comments - in my case, it happens with a bunch of statefulsets every time there's a rolling upgrade, and of course, the pod name is reused. It might be related to the lock in the file descriptor opened by the vector-agent container at the host level. Cannot think of any WA other than deleting the agent in the node whenever I spot this occurrence?

lpugoy commented 2 years ago

We also see this issue if a pod is recreated in the same node with the same name. It's pretty much a show stopper for us.

jszwedko commented 2 years ago

Thanks for the additional context @kitos9112 @Genry1 and @lpugoy ! I split this off to a separate issue since it seems to be different from the original report here. See: https://github.com/vectordotdev/vector/issues/13467

igor-nikiforov commented 2 years ago

@jszwedko We also faced with this issue after started rollout Vector to our production. We identified that the most safest version for using with kubernetes_logs source currently is 0.19.3. If we upgrade it to 0.2x.x we immediately start facing with 2 issues:

Vector just stopping watch the new pod logs after small period of time. It seems like this is random and does not affect all pods but it usually take near 2-10 minutes for it to start happening.
https://github.com/vectordotdev/vector/issues/13467 - which is affects StatefulSet pods and easily reproducable when you have 2-3 nodes with same count of StatefulSet pods.

In case of new versions (0.20.x) only working solution now just to restart Vector pods after some of these issues start happening - this helps in both cases. Both issues makes Vector completely unusable in production environments with kubernetes_logs since there is big risk to lose logs. I'm really hope that root cause of these issues will be identified and fixed in upcoming releases.

zamazan4ik commented 2 years ago

@jszwedko is there any updates regarding the issue? Stopping receiving logs from k8s is kinda a critical issue for us.

vectordotdev / vector