vector agent stopped working quietly after running several days

honganan commented 1 year ago

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

I am using vector agent to collect logs from file sources, and then sinks to Kafka. Vector agent runs as DaemonSet in k8s.

It stopped working after running every several days. I restarted it when I found the problem, but the same problem occurred after several days. The phenomenon is sinking to Kafka stopped, and CPU overhead come down obviously. And metrics fetched by Prometheus also disappeared. But vector POD still running and have some self log output.

Here is checkpoint files metrics graph: expr: rate(vector_checkpoints_total [1m])

vector_kafka_requests_total and CPU are basically the same.

I checked the error log near the problem time but have not find anything useful.

Part of logs:

2023-08-01T09:15:16.793899Z ERROR vector::internal_events::file::source: Failed reading file for fingerprinting. file=/var/log/pods/.../node-exporter/0.log error=No such file or directory (os error 2) error_code="reading_fingerprint" error_type="reader_failed" stage="receiving" internal_log_rate_limit=true
2023-08-01T09:15:16.796791Z ERROR vector::internal_events::file::source: Internal log [Failed reading file for fingerprinting.] is being suppressed to avoid flooding.
2023-08-01T09:15:17.550608Z  WARN source{component_kind="source" component_id=source_java_base_sentinel_log component_type=file component_name=source_java_base_sentinel_log}: vector::internal_events::encoding_transcode: Internal log [Replaced malformed sequences with replacement character while decoding to utf8.] has been suppressed 1 times.
2023-08-01T09:15:17.550625Z  WARN source{component_kind="source" component_id=source_java_base_sentinel_log component_type=file component_name=source_java_base_sentinel_log}: vector::internal_events::encoding_transcode: Replaced malformed sequences with replacement character while decoding to utf8. from_encoding=UTF-8 internal_log_rate_limit=true
2023-08-01T09:15:21.354278Z ERROR sink{component_kind="sink" component_id=sinks_log_kafka component_type=kafka component_name=sinks_log_kafka}: vector::internal_events::template: Internal log [Failed to render template.] has been suppressed 1 times.
2023-08-01T09:15:21.354299Z ERROR sink{component_kind="sink" component_id=sinks_log_kafka component_type=kafka component_name=sinks_log_kafka}: vector::internal_events::template: Failed to render template. error=Missing fields on event: [".type"] error_type="template_failed" stage="processing" internal_log_rate_limit=true
2023-08-01T09:15:21.354335Z ERROR sink{component_kind="sink" component_id=sinks_log_kafka component_type=kafka component_name=sinks_log_kafka}: vector_common::internal_event::component_events_dropped: Internal log [Events dropped] has been suppressed 1 times.
2023-08-01T09:15:21.354339Z ERROR sink{component_kind="sink" component_id=sinks_log_kafka component_type=kafka component_name=sinks_log_kafka}: vector_common::internal_event::component_events_dropped: Events dropped intentional=false count=1 reason="Failed to render template." internal_log_rate_limit=true
2023-08-01T09:15:21.361284Z ERROR sink{component_kind="sink" component_id=sinks_log_kafka component_type=kafka component_name=sinks_log_kafka}: vector::internal_events::template: Internal log [Failed to render template.] is being suppressed to avoid flooding.
2023-08-01T09:15:21.361323Z ERROR sink{component_kind="sink" component_id=sinks_log_kafka component_type=kafka component_name=sinks_log_kafka}: vector_common::internal_event::component_events_dropped: Internal log [Events dropped] is being suppressed to avoid flooding.

Configuration

Part of key configuration likes below:

sources:
  source_java_base_fast_log:
    type: file
    include:
    - /home/docker/logs/apps/*/info.log
    multiline:
      condition_pattern: '^(\d{4})-(\d{2})-(\d{2}) (\d{2}):(\d{2}):(\d{2})(\.(\d{3}))?'
      mode: halt_before
      start_pattern: '^(\d{4})-(\d{2})-(\d{2}) (\d{2}):(\d{2}):(\d{2})(\.(\d{3}))?'
      timeout_ms: 1000
    ignore_older_secs: 86400
    host_key: host
    file_key: filename
    fingerprint:
      strategy: device_and_inode
    max_line_bytes: 1024000
    max_read_bytes: 2048
    remove_after_secs: 172800
    encoding:
      charset: utf-8
transforms:
  trans_java_base_fast_log:
    type: remap
    inputs:
    - source_java_base_fast_log
    source: |-
      ...

api:
  address: 0.0.0.0:8686
  enabled: true
data_dir: /home/docker/vector/data
expire_metrics_secs: 600
log_schema:
  timestamp_key: '@timestamp'
sinks:
  sinks_log_kafka:
    batch:
      max_bytes: 100000
      max_events: 100
      timeout_secs: 0.1
    bootstrap_servers: "xxx"
    compression: snappy
    encoding:
      codec: json
    inputs:
    - trans*
    key_field: '{{.pod.name}}'
    librdkafka_options:
      acks: "1"
      queue.buffering.max.kbytes: "900"
      queue.buffering.max.messages: "100000"
      topic.metadata.refresh.interval.ms: "30000"
    healthcheck:
      enabled: false
    topic: 'log.{{.type}}'
    type: kafka
sources:
  source_internal_logs:
    type: internal_logs
timezone: local
transforms:
  trans_internal_log:
    inputs:
    - source_internal_logs
    source: |-
      .pod.name = get_env_var!("POD_NAME")
    type: remap

Version

v0.31.0

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

neuronull commented 1 year ago

Hello, thanks for the report.

It's unclear to me if there is one issue and it's being expressed/cascaded throughout the pipeline or if there are multiple issues.

The fingerprinting error from the file source definitely seems like an issue- the error doesn't indicate permissions problem but that the file that had been written, can no longer be read because it isn't there anymore. Were you able to validate the file's presence on disk? Could there be something removing that file? I need to check what the behavior of that source is if we failed to read the checkpoint file.
The transform warning could be relevant- apparently non-utf8 characters were found and replaced. I'm wondering if that is what led to the issue in the sink...
The kafka sink error seems to be coming from this line

    topic: 'log.{{.type}}'

, indicating that the .type field was not found on the event.

honganan commented 1 year ago

Thanks for replying!

I checked the source file in the error, it exists but is a symlink. I don't know if it matters.
The characters transform warning exists from beginning, It should be not harmful as usually understanding...
Our internal_metrics source did not have .type field, but been matched to the log sink pipeline, I revised it and can observe for a period of time. In addition, all the metrics disappeared is also strange.

honganan commented 1 year ago

New discovery, I checked the logs of the recent two vector crashes and found that the error message "Failed reading file for fingerprinting" was continuously outputted in the logs until the problem occurred and then stopped.

ERROR vector::internal_events::file::source: Failed reading file for fingerprinting. file=/var/log/pods/.../node-exporter/0.log error=No such file or directory (os error 2) error_code="reading_fingerprint" error_type="reader_failed" stage="receiving" internal_log_rate_limit=true

pront commented 1 year ago

Hi @honganan, apologies for the delayed response. I just read your last comment, can you confirm that the the source file is present and the vector process has read permissions?

honganan commented 1 year ago

Hi @honganan, apologies for the delayed response. I just read your last comment, can you confirm that the the source file is present and the vector process has read permissions?

@pront Thank you for the ongoing follow-up. I checked the file, It's a symlink and the real file is not exists. Will these cause some crash?

neuronull commented 1 year ago

I've been trying to reproduce this with various iterations on symlinks to files and dirs and config settings for the remove_after_secs and ignore_older_secs (since those seem relevant if the issue only shows up after some amount of days, where your config settings are 2 days and 1 day for those settings) , and so far have not had any luck.

Anytime the path of the link is broken whether a file or a dir, what happens is the source logs an INFO message that the file is now stopped being watched.

One thing I find suspicious about the output from

ERROR vector::internal_events::file::source: Failed reading file for fingerprinting. file=/var/log/pods/.../node-exporter/0.log error=No such file or directory (os error 2) error_code="reading_fingerprint" error_type="reader_failed" stage="receiving" internal_log_rate_limit=true

, is the file path:

/var/log/pods/.../node-exporter/0.log

Note the ... in the middle there. The triple dot notation is striking me as odd. I can't tell if that is the printing of the wildcard * or **, or something else going wrong.

Also there seems to be a discrepancy between that log message and the provided config.

/var/log/pods/.../node-exporter/0.log

    include:
    - /home/docker/logs/apps/*/info.log

The config is looking for logs named "info.log" , but the error message is from a file named "0.log".

Can you confirm we have the correct configuration settings?

honganan commented 1 year ago

Can you confirm we have the correct configuration settings?

Sorry for not pasting the configuration corresponding to the ERROR. I have many sources configs, Here is the one to the ERROR message:

sources:
  source_stdout_plain_log:
    type: file
    include:
    - /var/log/pods/monitoring_*/*/*.log
    - /var/log/pods/kube-system_*/*/*.log
    - /var/log/pods/ingress-nginx_*/*/*.log
    - /var/log/pods/base_*/*/*.log
    ignore_older_secs: 3600    # changed from 1d to 1h
    host_key: host
    file_key: filename
    max_line_bytes: 10240000
    max_read_bytes: 1024000
    remove_after_secs: 172800
    encoding:
      charset: utf-8

This error message is suspicious but not necessarily the root cause. I am trying to remove this source_stdout_plain_log config and watch for a while.

Thanks for your effort! Consider another way of thinking, Is it possible to find where it would crash? Like thread crashing. If we can find the weak point and makes it more robust by adding exception catch or something like. I am a new one to Rust and that's just my personal idea. Thanks for your exploratory works. Regards!

vectordotdev / vector