Write checkpoints atomicly to avoid "unable to recover checkpoint data from interrupted process" errors

ShahroZafar commented 3 months ago

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

We have vector running as daemonset in AWS.

After the OOMkill (OOMkilled for 1000 events/sec only with limit of 1024Mi which is a question in itself), we are seeing this error

2024-06-28T12:12:42.050066Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}:file_server: file_source::checkpointer: Unable to recover checkpoint data from interrupted process. error=EOF while parsing a value at line 1 column 0

What can actually lead to this error and if there is a way to avoid it?

Configuration

customConfig:
  data_dir: /vector-data-dir
  expire_metrics_secs: 300
  acknowledgements:
    enabled: true
  api:
    enabled: true
    address: 0.0.0.0:8686
    playground: false
  sources:
    internal_metrics:
      type: internal_metrics
    kubernetes_logs:
      type: kubernetes_logs
      glob_minimum_cooldown_ms: 2000
      ingestion_timestamp_field: "ingest_timestamp"
  transforms:
    dedot_keys:
      type: remap
      inputs:
        - kubernetes_logs
      source: |
        . = map_keys(., recursive: true) -> |key| { replace(key, ".", "_") }
  sinks:
    kafka:
      type: kafka
      inputs:
        - dedot_keys
      bootstrap_servers: kafka:9092
      topic: vector
      encoding:
        codec: json
      compression: zstd
      batch:
        timeout_secs: 1
        max_bytes: 1000000
        max_events: 5000
      librdkafka_options:
        client.id: "vector"
        request.required.acks: "1"
      message_timeout_ms: 0
      buffer:
        type: memory
        when_full: block
        max_events: 500
    prometheus_exporter:
      type: prometheus_exporter
      flush_period_secs: 60
      inputs:
        - internal_metrics
      address: 0.0.0.0:9090
      buffer:
        type: memory
        when_full: block
        max_events: 500

Version

0.39.0

Debug Output

No response

Example Data

2024-06-28T12:12:42.034854Z INFO vector::app: Log level is enabled. level="info" 2024-06-28T12:12:42.036718Z INFO vector::config::watcher: Creating configuration file watcher. 2024-06-28T12:12:42.037583Z INFO vector::config::watcher: Watching configuration files. 2024-06-28T12:12:42.037625Z INFO vector::app: Loading configs. paths=["/etc/vector"] 2024-06-28T12:12:42.039203Z WARN vector::config: Source has acknowledgements enabled by a sink, but acknowledgements are not supported by this source. Silent data loss could occur. source="kubernetes_logs" sink="kafka" 2024-06-28T12:12:42.039221Z WARN vector::config: Source has acknowledgements enabled by a sink, but acknowledgements are not supported by this source. Silent data loss could occur. source="internal_metrics" sink="prometheus_exporter" 2024-06-28T12:12:42.040969Z INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}: vector::sources::kubernetes_logs: Obtained Kubernetes Node name to collect logs for (self). self_node_name="ip-10-120-121-80.ec2.internal" 2024-06-28T12:12:42.047783Z INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}: vector::sources::kubernetes_logs: Including matching files. ret=["/*"] 2024-06-28T12:12:42.047800Z INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}: vector::sources::kubernetes_logs: Excluding matching files. ret=["*/.gz", "/*.tmp"] 2024-06-28T12:12:42.049402Z INFO vector::topology::running: Running healthchecks. 2024-06-28T12:12:42.049477Z INFO vector: Vector has started. debug="false" version="0.39.0" arch="x86_64" revision="73da9bb 2024-06-17 16:00:23.791735272" 2024-06-28T12:12:42.049485Z INFO vector::topology::builder: Healthcheck passed. 2024-06-28T12:12:42.049757Z INFO vector::sinks::prometheus::exporter: Building HTTP server. address=0.0.0.0:9090 2024-06-28T12:12:42.050066Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}:file_server: file_source::checkpointer: Unable to recover checkpoint data from interrupted process. error=EOF while parsing a value at line 1 column 0 2024-06-28T12:12:42.050221Z WARN librdkafka: librdkafka: CONFWARN [thrd:app]: Configuration property request.required.acks is a producer property and will be ignored by this consumer instance 2024-06-28T12:12:42.050532Z INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}:file_server: file_source::checkpointer: Loaded checkpoint data.

Additional Context

References

No response

jszwedko commented 3 months ago

Hi @ShahroZafar ,

That error is likely caused by a partial write by the Vector process due to the OOM kill. I think Vector could avoid this issue by not updating the checkpoint file in-place but instead writing a new file each time it is checkpointing and then moving the file (which would be an atomic operation). I'll leave this bug open to track that as an enhancement. Unfortunately I'm not sure there is much you can do to avoid it at the moment. Curiously I think this is the first report we've had mentioning that error so it seems to be uncommon 🤔

It could also be caused by other things like the disk being full, but that seems much less likely here.

Thanks for this report!

ShahroZafar commented 3 months ago

Hi @jszwedko,

I would saw the source to be file in the labels. However the source is kubernetes_logs and the buffer type is memory.

Thanks

vectordotdev / vector