Open honganan opened 1 year ago
Hello, thanks for the report.
It's unclear to me if there is one issue and it's being expressed/cascaded throughout the pipeline or if there are multiple issues.
The fingerprinting error from the file source definitely seems like an issue- the error doesn't indicate permissions problem but that the file that had been written, can no longer be read because it isn't there anymore. Were you able to validate the file's presence on disk? Could there be something removing that file? I need to check what the behavior of that source is if we failed to read the checkpoint file.
The transform warning could be relevant- apparently non-utf8 characters were found and replaced. I'm wondering if that is what led to the issue in the sink...
The kafka sink error seems to be coming from this line
topic: 'log.{{.type}}'
, indicating that the .type
field was not found on the event.
Thanks for replying!
internal_metrics
source did not have .type
field, but been matched to the log sink pipeline, I revised it and can observe for a period of time.
In addition, all the metrics disappeared is also strange.New discovery, I checked the logs of the recent two vector crashes and found that the error message "Failed reading file for fingerprinting" was continuously outputted in the logs until the problem occurred and then stopped.
ERROR vector::internal_events::file::source: Failed reading file for fingerprinting. file=/var/log/pods/.../node-exporter/0.log error=No such file or directory (os error 2) error_code="reading_fingerprint" error_type="reader_failed" stage="receiving" internal_log_rate_limit=true
Hi @honganan, apologies for the delayed response. I just read your last comment, can you confirm that the the source file is present and the vector process has read permissions?
Hi @honganan, apologies for the delayed response. I just read your last comment, can you confirm that the the source file is present and the vector process has read permissions?
@pront Thank you for the ongoing follow-up. I checked the file, It's a symlink and the real file is not exists. Will these cause some crash?
I've been trying to reproduce this with various iterations on symlinks to files and dirs and config settings for the remove_after_secs
and ignore_older_secs
(since those seem relevant if the issue only shows up after some amount of days, where your config settings are 2 days and 1 day for those settings) , and so far have not had any luck.
Anytime the path of the link is broken whether a file or a dir, what happens is the source logs an INFO message that the file is now stopped being watched.
One thing I find suspicious about the output from
ERROR vector::internal_events::file::source: Failed reading file for fingerprinting. file=/var/log/pods/.../node-exporter/0.log error=No such file or directory (os error 2) error_code="reading_fingerprint" error_type="reader_failed" stage="receiving" internal_log_rate_limit=true
, is the file path:
/var/log/pods/.../node-exporter/0.log
Note the ...
in the middle there. The triple dot notation is striking me as odd. I can't tell if that is the printing of the wildcard *
or **
, or something else going wrong.
Also there seems to be a discrepancy between that log message and the provided config.
/var/log/pods/.../node-exporter/0.log
include:
- /home/docker/logs/apps/*/info.log
The config is looking for logs named "info.log" , but the error message is from a file named "0.log".
Can you confirm we have the correct configuration settings?
Can you confirm we have the correct configuration settings?
Sorry for not pasting the configuration corresponding to the ERROR. I have many sources
configs, Here is the one to the ERROR message:
sources:
source_stdout_plain_log:
type: file
include:
- /var/log/pods/monitoring_*/*/*.log
- /var/log/pods/kube-system_*/*/*.log
- /var/log/pods/ingress-nginx_*/*/*.log
- /var/log/pods/base_*/*/*.log
ignore_older_secs: 3600 # changed from 1d to 1h
host_key: host
file_key: filename
max_line_bytes: 10240000
max_read_bytes: 1024000
remove_after_secs: 172800
encoding:
charset: utf-8
This error message is suspicious but not necessarily the root cause. I am trying to remove this source_stdout_plain_log
config and watch for a while.
Thanks for your effort! Consider another way of thinking, Is it possible to find where it would crash? Like thread crashing. If we can find the weak point and makes it more robust by adding exception catch or something like. I am a new one to Rust and that's just my personal idea. Thanks for your exploratory works. Regards!
A note for the community
Problem
I am using vector agent to collect logs from
file sources
, and then sinks to Kafka. Vector agent runs as DaemonSet in k8s.It stopped working after running every several days. I restarted it when I found the problem, but the same problem occurred after several days. The phenomenon is sinking to Kafka stopped, and CPU overhead come down obviously. And metrics fetched by Prometheus also disappeared. But vector POD still running and have some self log output.
Here is checkpoint files metrics graph: expr:
rate(vector_checkpoints_total [1m])
vector_kafka_requests_total
and CPU are basically the same.I checked the error log near the problem time but have not find anything useful.
Part of logs:
Configuration
Version
v0.31.0
Debug Output
No response
Example Data
No response
Additional Context
No response
References
No response