Open ShahroZafar opened 3 months ago
Hi @ShahroZafar ,
That error is likely caused by a partial write by the Vector process due to the OOM kill. I think Vector could avoid this issue by not updating the checkpoint file in-place but instead writing a new file each time it is checkpointing and then moving the file (which would be an atomic operation). I'll leave this bug open to track that as an enhancement. Unfortunately I'm not sure there is much you can do to avoid it at the moment. Curiously I think this is the first report we've had mentioning that error so it seems to be uncommon 🤔
It could also be caused by other things like the disk being full, but that seems much less likely here.
Thanks for this report!
Hi @jszwedko,
I would saw the source to be file
in the labels. However the source is kubernetes_logs
and the buffer type is memory
.
Thanks
A note for the community
Problem
We have vector running as daemonset in AWS.
After the OOMkill (OOMkilled for 1000 events/sec only with limit of 1024Mi which is a question in itself), we are seeing this error
What can actually lead to this error and if there is a way to avoid it?
Configuration
Version
0.39.0
Debug Output
No response
Example Data
2024-06-28T12:12:42.034854Z INFO vector::app: Log level is enabled. level="info" 2024-06-28T12:12:42.036718Z INFO vector::config::watcher: Creating configuration file watcher. 2024-06-28T12:12:42.037583Z INFO vector::config::watcher: Watching configuration files. 2024-06-28T12:12:42.037625Z INFO vector::app: Loading configs. paths=["/etc/vector"] 2024-06-28T12:12:42.039203Z WARN vector::config: Source has acknowledgements enabled by a sink, but acknowledgements are not supported by this source. Silent data loss could occur. source="kubernetes_logs" sink="kafka" 2024-06-28T12:12:42.039221Z WARN vector::config: Source has acknowledgements enabled by a sink, but acknowledgements are not supported by this source. Silent data loss could occur. source="internal_metrics" sink="prometheus_exporter" 2024-06-28T12:12:42.040969Z INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}: vector::sources::kubernetes_logs: Obtained Kubernetes Node name to collect logs for (self). self_node_name="ip-10-120-121-80.ec2.internal" 2024-06-28T12:12:42.047783Z INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}: vector::sources::kubernetes_logs: Including matching files. ret=["/*"] 2024-06-28T12:12:42.047800Z INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}: vector::sources::kubernetes_logs: Excluding matching files. ret=["*/.gz", "/*.tmp"] 2024-06-28T12:12:42.049402Z INFO vector::topology::running: Running healthchecks. 2024-06-28T12:12:42.049477Z INFO vector: Vector has started. debug="false" version="0.39.0" arch="x86_64" revision="73da9bb 2024-06-17 16:00:23.791735272" 2024-06-28T12:12:42.049485Z INFO vector::topology::builder: Healthcheck passed. 2024-06-28T12:12:42.049757Z INFO vector::sinks::prometheus::exporter: Building HTTP server. address=0.0.0.0:9090 2024-06-28T12:12:42.050066Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}:file_server: file_source::checkpointer: Unable to recover checkpoint data from interrupted process. error=EOF while parsing a value at line 1 column 0 2024-06-28T12:12:42.050221Z WARN librdkafka: librdkafka: CONFWARN [thrd:app]: Configuration property request.required.acks is a producer property and will be ignored by this consumer instance 2024-06-28T12:12:42.050532Z INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs}:file_server: file_source::checkpointer: Loaded checkpoint data.
Additional Context
References
No response