vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
18.14k stars 1.6k forks source link

Negative Buffer Usage being reported in `Datadog` and `Prometheus` and S3 sink hangs forever #17666

Open smitthakkar96 opened 1 year ago

smitthakkar96 commented 1 year ago

A note for the community

Problem

We had an incident today similar to this. Our S3 sink buffer was reporting negative values (see the screenshots).

Screenshot 2023-06-12 at 11 42 57

Screenshot 2023-06-12 at 13 55 01

I also noticed that outgoing events dropped to 0 during this time, and our source time lag increased.

Screenshot 2023-06-12 at 14 00 32

Screenshot 2023-06-12 at 13 59 27

We have seen this in past when a sink with a blocking buffer gets full. Even if the client stops/slows down sending events due to backpressure, the buffer usage stays the same, causing it to hang forever.

Configuration

  aws_s3_archive.toml: |-
    type = "aws_s3"
    inputs = [ "remap_archive_logs_for_datadog" ]
    key_prefix = "dt=%Y%m%d/hour=%H/"
    compression = "gzip"
    content_encoding = "none"
    bucket = "vector-logs-archive-prod-asia"
    content_type = "application/x-gzip"
    filename_extension = "json.gz"
    filename_time_format = "archive_%H%M%S.%3f0.e"

    framing.method = "newline_delimited"
    encoding.codec = "json"

    batch.max_bytes = 256000000

    buffer.type = "disk"
    buffer.max_size = 8000000000

    buffer.when_full = "block"
  datadog_logs.toml: |-
    type = "datadog_logs"
    inputs = [ "internal_logs", "drop_nginx_ingress_logs" ]
    default_api_key = "${DATADOG_API_KEY}"
    site = "datadoghq.eu"

    buffer.type = "disk"
    buffer.max_size = 36000000000

    buffer.when_full = "block"
  internal_metrics_exporter.toml: |-
    type = "prometheus_exporter"
    inputs = [ "remap_enrich_internal_metrics_with_static_tags" ]
    distributions_as_summaries = true
    address = "0.0.0.0:9598"
  nginx_ingress_logs_s3_archive.toml: |-
    # S3 Sink to archive nginx ingress logs

    type = "aws_s3"
    inputs = [ "filter_nginx_logs_for_archival" ]
    key_prefix = "dt=%Y%m%d/hour=%H/"
    compression = "gzip"
    content_encoding = "none"
    bucket = "nginx-ingress-logs-archive-prod-asia"
    content_type = "application/x-gzip"
    filename_extension = "json.gz"
    filename_time_format = "archive_%H%M%S.%3f0.e"

    framing.method = "newline_delimited"
    encoding.codec = "json"

    batch.max_bytes = 256000000

    buffer.type = "disk"
    buffer.max_size = 4000000000

    buffer.when_full = "drop_newest"

Version

vector 0.28.0

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

smitthakkar96 commented 1 year ago

@jszwedko is it related to https://github.com/vectordotdev/vector/issues/15683 by any chance?

smitthakkar96 commented 10 months ago
Screenshot 2024-01-10 at 10 48 16

A similar issue popped up in v0.33.0. The buffer events metric dropped to a negative number and at the same time, we got an alert about datadog agent receiving small amount of errors when communicating with Vector. I didn't see anything weird in the logs around that time, and restarting Vector solved the problem.