Troubleshoot a huge loss of container logs in Vector Aggregator pipeline

namm2 commented 2 days ago

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Hi, I'm running a centralized Vector Aggregator logs processing pipeline with Vector Agents to collect kubernetes logs. And during a peak traffic where there were a lot of container logs that's created in a short period of time, but many of these logs were not ingested in to the backend system.

On the Vector Aggregator's source metrics we see the amount of incoming events were growing, and the same with the vector_component_discarded_events_total metric:

The Vector Aggreagor was deployed with the helm chart, and there's no resources requests/limits for it. During this time, its CPU usage was ~ 8 CPUs (of the total 16 CPUs instance):

On the Vector Agent's internal logs, there were a lot of errors like this:

Request failed: status: Internal, message: "protocol error: received message with invalid compression flag: 60 (valid flags are 0 and 1) while receiving response with status: 504 Gateway Timeout", details: [], metadata: MetadataMap { headers: {"content-length": "92", "cache-control": "no-cache", "content-type": "text/html"} }

I could understand that Vector Agents were receiving requests timeout while trying to send logs to Vector Aggregator, and it got back-pressure from the upstream Aggregator as the result Vector Agent's sink buffer is full (max_events: 50000):

I guess this is the correlation with the full buffer in its sink, Vector Agent's vector_source_lag_time_seconds_bucket metric values were increased, and the worst was 4000 seconds lag:

At first my guess for the data loss was due to container logs rotation (GKE logrotate has fixed to 5 log files and 10MB filesize) Vector Agent couldn't read the old log files once Vector Aggregator is healthy. But it seems like it's not the case here, because Vector Aggregator metrics showed it received log events and then discarded some of the log events, then the data loss was at Vector Aggregator as it dropped some of the logs?

Configuration

Vector agent configs:

  sources:
    kubernetes:
      type: kubernetes_logs
      max_line_bytes: 131072
      delay_deletion_ms: 3600000  # 1 hour
      glob_minimum_cooldown_ms: 2000  # 2 seconds
<snip>

  sinks:
    vector-aggregator:
      type: vector
      buffer:
        max_events: 50000
      batch:
        max_bytes: 5000000  # 5MB
        max_events: 5000
      inputs:
        - kubernetes
      address: vector-aggregator.logging.svc:9088
      compression: true

Vector aggregator configs:

  sources:
    kubernetes:
      type: vector
      address: 0.0.0.0:9088



### Version

0.39.0

### Debug Output

_No response_

### Example Data

_No response_

### Additional Context

_No response_

### References

_No response_

jszwedko commented 1 day ago

Do you see any associated error logs in the Aggregator that indicate why it was discarding events?

anas-aso commented 1 day ago

@jszwedko this is the error we see with this incident:

{
    "message": "Events dropped",
    "metadata":
    {
        "kind": "event",
        "level": "ERROR",
        "module_path": "vector_common::internal_event::component_events_dropped",
        "target": "vector_common::internal_event::component_events_dropped"
    },
    "pid": 1,
    "reason": "Source send cancelled.",
    "source_type": "internal_logs"
}

vectordotdev / vector