Open namm2 opened 2 days ago
Do you see any associated error logs in the Aggregator that indicate why it was discarding events?
@jszwedko this is the error we see with this incident:
{
"message": "Events dropped",
"metadata":
{
"kind": "event",
"level": "ERROR",
"module_path": "vector_common::internal_event::component_events_dropped",
"target": "vector_common::internal_event::component_events_dropped"
},
"pid": 1,
"reason": "Source send cancelled.",
"source_type": "internal_logs"
}
A note for the community
Problem
Hi, I'm running a centralized Vector Aggregator logs processing pipeline with Vector Agents to collect kubernetes logs. And during a peak traffic where there were a lot of container logs that's created in a short period of time, but many of these logs were not ingested in to the backend system.
On the Vector Aggregator's source metrics we see the amount of incoming events were growing, and the same with the
vector_component_discarded_events_total
metric:The Vector Aggreagor was deployed with the helm chart, and there's no resources requests/limits for it. During this time, its CPU usage was ~ 8 CPUs (of the total 16 CPUs instance):
On the Vector Agent's internal logs, there were a lot of errors like this:
I could understand that Vector Agents were receiving requests timeout while trying to send logs to Vector Aggregator, and it got back-pressure from the upstream Aggregator as the result Vector Agent's sink buffer is full (max_events: 50000):
I guess this is the correlation with the full buffer in its sink, Vector Agent's
vector_source_lag_time_seconds_bucket
metric values were increased, and the worst was 4000 seconds lag:At first my guess for the data loss was due to container logs rotation (GKE logrotate has fixed to 5 log files and 10MB filesize) Vector Agent couldn't read the old log files once Vector Aggregator is healthy. But it seems like it's not the case here, because Vector Aggregator metrics showed it received log events and then discarded some of the log events, then the data loss was at Vector Aggregator as it dropped some of the logs?
Configuration
Vector aggregator configs: