Open ee07b415 opened 11 months ago
Hi @ee07b415,
I suspect what might be happening here is that some of the requests sent from the agent to the aggregator are timing out, which causes the agent to retry. When this happens, the agent will count the events in that payload once but the aggregator will count those events n
times, where n
is the number of times the request was sent.
To detect if this is the case, you can compare the following two queries. The vector source in the aggregator will only increment vector_component_sent_events_total
when it finishes processing the request, whereas it increments vector_component_receieved_events_total
when it begins processing the request.
sum(rate(vector_component_sent_events_total{component_id="vector_source"}[5m]))
sum(rate(vector_component_sent_events_total{component_id="vector_sink"}[5m]))
If the above queries are comparable, then that is likely the issue.
Also, if you upgrade to the latest version (v0.34.1), there are additional metrics that we can use to investigate:
Component Dropped Events:
component_discarded_events_total
vector
source in the aggregator is dropping events, then the above scenario is happening - requests are timing out in the agent and its retrying those requests.GRPC server metrics:
vector_grpc_server_messages_received_total
vector_grpc_server_messages_sent_total
vector_grpc_server_handler_duration_seconds
vector
source's response times to the agent.Please try out the above and let us know what you find.
Hi @dsmith3197 , thanks for providing the information, I will try them out, we do have the component discarded event metrics so we know we didn't lose the events
Hi @dsmith3197 , the sum(rate(vector_component_sent_events_total{component_id="vector_source"}[5m])) is the same with sum(rate(vector_component_received_events_total{component_id="vector_source"}[5m])) so change the query the chart is the same
I see, thank you for checking.
I would still advise you to upgrade to Vector v0.34.1. This includes the additional server metrics mentioned above and a bug fix in sources that might help identify the issue.
Specifically, the v0.34.0 release includes the following
Sources now correctly emit a log and increment
component_discarded_events_total
when incoming requests are cancelled before the events are pushed to downstream components.
Other than that, it is possible that there are other clients sending sending to the vector_source
in the aggregator?
A note for the community
Problem
We are using vector for both metrics agent and aggregator, in order to check the data quality we build a dashboard which reporting the rate of the metrics count we received on aggregator vs the rate of the metrics send from all vector agents(we have more than 60 at this time). We thought the number should be competitive between these two but it seems the aggregator always report bigger number than the sum of the senders.
It doesn't matter how many agent we are using, the aggregator seems always around 1% bigger in the total event count.
We have the internal metrics from the agent also send to the aggregator, so we can compare their value from the same location: Query for received total: sum(rate(vector_component_received_events_total{component_id="vector_source"}[5m])) Query for send total: sum(rate(vector_component_sent_events_total{component_id="vector_sink"}[5m]))
Please find the "vector_source" and "vector_sink" from the following configuration section.
I'm not sure if it is our mistake or some metrics are count differently for source and sink in vector.
Configuration
Version
vector 0.33.0
Debug Output
No response
Example Data
No response
Additional Context
No response
References
No response