vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.48k stars 1.53k forks source link

Hitting "failed to decode record immediately after encoding it" frequently #20212

Open PaulFurtado opened 5 months ago

PaulFurtado commented 5 months ago

A note for the community

Problem

On our vector configs that handle kubernetes audit log entries, we are frequently hitting this "failed to decode record immediately after encoding it" error from here: https://github.com/vectordotdev/vector/blob/v0.37.0/lib/vector-buffers/src/variants/disk_v2/writer.rs#L535-L543

Configuration

These should be the relevant bits of the vector config:

sources:
  kube-audit-log-file:
    include:
    - "/var/log/k8s/audit/*.log"
    line_delimiter: "\n"
    max_line_bytes: 20971520
    max_read_bytes: 10485760
    oldest_first: true
    read_from: beginning
    type: file

transforms:
  kube-audit-logs-transform:
    drop_on_error: true
    inputs:
    - kube-audit-log-file
    source: |
      . = parse_json!(.message)
      .timestamp = parse_timestamp!(.stageTimestamp, format: "%+")
      del(.stageTimestamp)
    type: remap

sinks:
  kube-audit-logs-s3:
    batch:
      max_bytes: 104857600
      max_events: 1000000
      timeout_secs: 120
    bucket: redacted
    buffer:
      max_size: 272629760
      type: disk
    compression: gzip
    encoding:
      codec: json
    filename_append_uuid: false
    filename_time_format: "%FT%TZ"
    framing:
      method: newline_delimited
    healthcheck:
      enabled: false
    inputs:
    - kube-audit-logs-transform
    key_prefix: redacted/instance_id=redacted/y=%Y/m=%m/d=%d/h=%H/kube-audit-
    region: us-east-1
    type: aws_s3

My suspicion is that:

. = parse_json!(.message)

is causing us to pull a field out of the audit log into the top-level of the vector log entry that vector can't handle, but I'm really not sure.


### Version

0.37.0

### Debug Output

```text
The particular log entry that was causing this problem yesterday has since aged out of the input logs on this host, so I don't have a clean log with `RUST_BACKTRACE=full` turned on, but here is what I think the important piece of logs was from my terminal yesterday:

2024-03-29T03:14:58.612656Z DEBUG vector::internal_events::file::source: Files checkpointed. count=2 duration_ms=0
2024-03-29T03:14:58.793276Z DEBUG source{component_kind="source" component_id=kube-audit-log-file component_type=file}:file_server: file_source::file_server: event_throughput=32.224k/sec bytes_throughput=66.396m/sec ratios={"discovery": 0.14674045, "other": 9.2514914e-5, "reading": 0.04861154, "sending": 0.80454713, "sleeping": 8.445541e-6}
2024-03-29T03:14:59.614624Z DEBUG vector::internal_events::file::source: Files checkpointed. count=2 duration_ms=0
2024-03-29T03:14:59.954524Z DEBUG source{component_kind="source" component_id=kube-audit-log-file component_type=file}:file_server: file_source::file_server: event_throughput=31.825k/sec bytes_throughput=62.889m/sec ratios={"discovery": 0.14409374, "other": 8.7578e-5, "reading": 0.04391994, "sending": 0.8118912, "sleeping": 7.5555213e-6}
2024-03-29T03:15:00.616820Z DEBUG vector::internal_events::file::source: Files checkpointed. count=3 duration_ms=0
2024-03-29T03:15:00.952366Z DEBUG transform{component_kind="transform" component_id=kube-audit-logs-transform component_type=remap}: vector::utilization: utilization=0.1351320176639663
2024-03-29T03:15:00.952378Z DEBUG transform{component_kind="transform" component_id=kube-audit-logs-filter-deprecated-apis component_type=filter}: vector::utilization: utilization=0.3501196497085538
2024-03-29T03:15:00.952396Z DEBUG transform{component_kind="transform" component_id=kube-audit-logs-stale-token component_type=filter}: vector::utilization: utilization=0.24884998544339113
2024-03-29T03:15:01.002288Z DEBUG transform{component_kind="transform" component_id=kube-audit-logs-transform component_type=remap}:write_record:try_write_record_inner: vector_buffers::variants::disk_v2::writer: Archived record is too large to fit in remaining free space of current data file. current_data_file_size=134078256 max_data_file_size=134217728 archive_on_disk_len=1931168
2024-03-29T03:15:01.020429Z DEBUG transform{component_kind="transform" component_id=kube-audit-logs-transform component_type=remap}:write_record:try_write_record_inner: vector_buffers::variants::disk_v2::writer: Current data file reached maximum size. Rolling to the next data file. current_data_file_size=134078256 max_data_file_size=134217728 last_attempted_write_size=1931168
2024-03-29T03:15:01.023677Z DEBUG sink{component_kind="sink" component_id=grafana component_type=prometheus_remote_write}: vector::utilization: utilization=0.0003672201079320879
2024-03-29T03:15:01.029705Z DEBUG transform{component_kind="transform" component_id=kube-audit-logs-transform component_type=remap}:write_record:try_write_record_inner:ensure_ready_for_write: vector_buffers::variants::disk_v2::writer: Opened data file for writing. data_file_path="/var/vector/kube-audit-data/buffer/v2/kube-audit-logs-s3/buffer-data-2.dat" existing_file_size=0
2024-03-29T03:15:01.029873Z DEBUG transform{component_kind="transform" component_id=kube-audit-logs-transform component_type=remap}:write_record:try_write_record_inner:ensure_ready_for_write: vector_buffers::variants::disk_v2::writer: Writer now on new data file. new_writer_file_id=2
2024-03-29T03:15:01.032136Z DEBUG sink{component_kind="sink" component_id=kube-audit-logs-s3 component_type=aws_s3}: vector::utilization: utilization=0.07421148626875576
2024-03-29T03:15:01.032202Z DEBUG sink{component_kind="sink" component_id=kube-audit-logs-s3 component_type=aws_s3}: vector_buffers::variants::disk_v2::reader: Reached the end of current data file. reader_file_id=1 writer_file_id=2
2024-03-29T03:15:01.032219Z DEBUG sink{component_kind="sink" component_id=kube-audit-logs-s3 component_type=aws_s3}: vector_buffers::variants::disk_v2::reader: Marking data file for deletion. data_file_path="/var/vector/kube-audit-data/buffer/v2/kube-audit-logs-s3/buffer-data-1.dat" first_record_id=57001 last_record_id=128000 record_count=71 event_count=71000 bytes_read=134078256
2024-03-29T03:15:01.032253Z DEBUG sink{component_kind="sink" component_id=kube-audit-logs-s3 component_type=aws_s3}: vector_buffers::variants::disk_v2::reader: Rolling to next data file.
2024-03-29T03:15:01.032304Z DEBUG sink{component_kind="sink" component_id=kube-audit-logs-s3 component_type=aws_s3}: vector_buffers::variants::disk_v2::reader: Opened data file for reading. data_file_path="/var/vector/kube-audit-data/buffer/v2/kube-audit-logs-s3/buffer-data-2.dat"
2024-03-29T03:15:01.164051Z DEBUG source{component_kind="source" component_id=kube-audit-log-file component_type=file}:file_server: file_source::file_server: event_throughput=31.593k/sec bytes_throughput=60.798m/sec ratios={"discovery": 0.14129955, "other": 8.534103e-5, "reading": 0.040749304, "sending": 0.8178586, "sleeping": 7.3234564e-6}
2024-03-29T03:15:01.618553Z DEBUG vector::internal_events::file::source: Files checkpointed. count=3 duration_ms=0
2024-03-29T03:15:02.401606Z DEBUG source{component_kind="source" component_id=kube-audit-log-file component_type=file}:file_server: file_source::file_server: event_throughput=32.496k/sec bytes_throughput=59.391m/sec ratios={"discovery": 0.13891983, "other": 8.299668e-5, "reading": 0.039334483, "sending": 0.82165587, "sleeping": 6.8271243e-6}
2024-03-29T03:15:02.622472Z DEBUG vector::internal_events::file::source: Files checkpointed. count=4 duration_ms=2
2024-03-29T03:15:02.756229Z ERROR sink{component_kind="sink" component_id=kube-audit-logs-s3 component_type=aws_s3}: vector_buffers::internal_events: Error encountered during buffer read. error=failed to decoded record: InvalidProtobufPayload error_code="decode_failed" error_type="reader_failed" stage="processing" internal_log_rate_limit=true
2024-03-29T03:15:03.134841Z ERROR sink{component_kind="sink" component_id=kube-audit-logs-s3 component_type=aws_s3}: vector_buffers::internal_events: Internal log [Error encountered during buffer read.] is being suppressed to avoid flooding.
2024-03-29T03:15:03.569558Z DEBUG source{component_kind="source" component_id=kube-audit-log-file component_type=file}:file_server: file_source::file_server: event_throughput=33.237k/sec bytes_throughput=58.397m/sec ratios={"discovery": 0.13845907, "other": 8.2831175e-5, "reading": 0.038347527, "sending": 0.82310367, "sleeping": 6.9105263e-6}
2024-03-29T03:15:03.623265Z DEBUG vector::internal_events::file::source: Files checkpointed. count=4 duration_ms=0
2024-03-29T03:15:03.853501Z DEBUG transform{component_kind="transform" component_id=kube-audit-logs-transform component_type=remap}:write_record:try_write_record_inner: vector_buffers::variants::disk_v2::writer: Archived record is too large to fit in remaining free space of current data file. current_data_file_size=131745744 max_data_file_size=134217728 archive_on_disk_len=2638184
2024-03-29T03:15:03.857754Z ERROR transform{component_kind="transform" component_id=kube-audit-logs-transform component_type=remap}: vector_buffers::topology::channel::sender: Disk buffer writer has encountered an unrecoverable error.
2024-03-29T03:15:03.871450Z DEBUG transform{component_kind="transform" component_id=kube-audit-logs-transform component_type=remap}: vector::topology::builder: Synchronous transform finished with an error.
2024-03-29T03:15:03.871471Z ERROR transform{component_kind="transform" component_id=kube-audit-logs-transform component_type=remap}: vector::topology: An error occurred that Vector couldn't handle: writer entered inconsistent state: failed to decode record immediately after encoding it.
2024-03-29T03:15:03.871505Z DEBUG source{component_kind="source" component_id=kube-audit-log-file component_type=file}: vector::topology::builder: Source pump finished with an error.
2024-03-29T03:15:03.871515Z  INFO vector: Vector has stopped.
2024-03-29T03:15:03.871533Z DEBUG source{component_kind="source" component_id=kube-audit-log-file component_type=file}: vector::topology::builder: Source pump supervisor task finished with an error.
2024-03-29T03:15:03.871554Z DEBUG source{component_kind="source" component_id=kube-audit-log-file component_type=file}: vector::topology::builder: Source finished with an error.
2024-03-29T03:15:03.871565Z ERROR source{component_kind="source" component_id=kube-audit-log-file component_type=file}: vector::topology: An error occurred that Vector couldn't handle: receiver disconnected.
2024-03-29T03:15:03.871594Z DEBUG transform{component_kind="transform" component_id=kube-audit-logs-filter-deprecated-apis component_type=filter}: vector::topology::builder: Synchronous transform finished normally.
2024-03-29T03:15:03.872060Z ERROR vector::internal_events::common: Failed to forward event(s), downstream is closed. error_code="stream_closed" error_type="writer_failed" stage="sending" internal_log_rate_limit=true

### Example Data

Kube audit log entries are quite large since they contain entire request bodies. On servers where this problem occurs, we are typically generating 100MB of audit logs per minute with minutely file rotation, so it's really impractical to track down which entries are causing the problem.

Is there a way that I can make it print the bad entry to its log when it encounters one so I can easily give you an example that causes the problem?

Typically these log file containline-delimited JSON, but I have pretty printed one of the smaller entries for an example of what data they contain:
```json
{
  "kind": "Event",
  "apiVersion": "audit.k8s.io/v1",
  "level": "Metadata",
  "auditID": "ec121b54-6cc5-4f54-ade1-267490d9a876",
  "stage": "ResponseComplete",
  "requestURI": "/apis/hubspot.com/v1/namespaces/kube-system/leader-elections/redacted",
  "verb": "get",
  "user": {
    "username": "system:serviceaccount:kube-system:redacted",
    "uid": "fe388e9a-88ff-48fc-8609-e0c04fa2d868",
    "groups": [
      "system:serviceaccounts",
      "system:serviceaccounts:kube-system",
      "system:authenticated"
    ],
    "extra": {
      "authentication.kubernetes.io/pod-name": [
        "redacted-6dcfbddbff-t96rt"
      ],
      "authentication.kubernetes.io/pod-uid": [
        "af047acb-e5a9-4d4f-aae7-deb62fa18000"
      ]
    }
  },
  "sourceIPs": [
    "172.25.17.157"
  ],
  "userAgent": "redacted",
  "objectRef": {
    "resource": "leader-elections",
    "namespace": "kube-system",
    "name": "redacted",
    "apiGroup": "redacted",
    "apiVersion": "v1"
  },
  "responseStatus": {
    "metadata": {},
    "code": 200
  },
  "requestReceivedTimestamp": "2024-03-29T20:01:26.231261Z",
  "stageTimestamp": "2024-03-29T20:01:26.232960Z",
  "annotations": {
    "authorization.k8s.io/decision": "allow",
    "authorization.k8s.io/reason": "RBAC: allowed by RoleBinding \"kube-rbac-system:serviceaccount:kube-system:redacted-for-namespace-kube-system-kube-rbac-system:serviceaccount:kube-system:redacted-for-namespace-kube-system-cluster-role-role-binding/kube-system\" of ClusterRole \"kube-rbac-system:serviceaccount:kube-system:redacted-for-namespace-kube-system-cluster-role\" to User \"system:serviceaccount:kube-system:redacted\""
  }
}

Additional Context

No response

References

No response

tobz commented 5 months ago

Hmmm, this is indeed puzzling.

The code does work most of the time, so perhaps this is in fact related to it being an oversized record... although in that, it should fail way before this point. 😂

I'll see if we can reproduce this locally.

PaulFurtado commented 5 months ago

@tobz happy to help here too, is there an easy place we can update the code to log the source record that caused the problem? Would be great if we could provide you a reproducer