Vector aggregator received events count always bigger than sum of the vector agents' send event count

ee07b415 commented 11 months ago

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

We are using vector for both metrics agent and aggregator, in order to check the data quality we build a dashboard which reporting the rate of the metrics count we received on aggregator vs the rate of the metrics send from all vector agents(we have more than 60 at this time). We thought the number should be competitive between these two but it seems the aggregator always report bigger number than the sum of the senders.

It doesn't matter how many agent we are using, the aggregator seems always around 1% bigger in the total event count.

We have the internal metrics from the agent also send to the aggregator, so we can compare their value from the same location: Query for received total: sum(rate(vector_component_received_events_total{component_id="vector_source"}[5m])) Query for send total: sum(rate(vector_component_sent_events_total{component_id="vector_sink"}[5m]))

Please find the "vector_source" and "vector_sink" from the following configuration section.

I'm not sure if it is our mistake or some metrics are count differently for source and sink in vector.

Configuration

On vector agent
sources:
  internal_metric_source:
    type: internal_metrics
    scrape_interval_secs: 60
  ip-10-0-1-196:
    type: prometheus_scrape
    endpoints:
    - http://10.0.1.196:9999/metrics
    scrape_interval_secs: 60
vector_sink:
    type: vector
    inputs:
    - internal_metric_source
    - ip-10-0-1-196
    address: https://grpc.com
    tls:
      alpn_protocols:
      - h2
      enabled: true
    buffer:
      type: disk
      when_full: drop_newest
      max_size: 10737418240
    batch:
      max_bytes: 2500000
      timeout_secs: 15
    healthcheck: true
    compression: true

on aggregator
sources:
    vector_source:
      type: vector
      address: 0.0.0.0:9556

Version

vector 0.33.0

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

dsmith3197 commented 11 months ago

Hi @ee07b415,

I suspect what might be happening here is that some of the requests sent from the agent to the aggregator are timing out, which causes the agent to retry. When this happens, the agent will count the events in that payload once but the aggregator will count those events n times, where n is the number of times the request was sent.

To detect if this is the case, you can compare the following two queries. The vector source in the aggregator will only increment vector_component_sent_events_total when it finishes processing the request, whereas it increments vector_component_receieved_events_total when it begins processing the request.

sum(rate(vector_component_sent_events_total{component_id="vector_source"}[5m]))
sum(rate(vector_component_sent_events_total{component_id="vector_sink"}[5m]))

If the above queries are comparable, then that is likely the issue.

Also, if you upgrade to the latest version (v0.34.1), there are additional metrics that we can use to investigate:

Component Dropped Events:

component_discarded_events_total
- If the vector source in the aggregator is dropping events, then the above scenario is happening - requests are timing out in the agent and its retrying those requests.

GRPC server metrics:

vector_grpc_server_messages_received_total
vector_grpc_server_messages_sent_total
vector_grpc_server_handler_duration_seconds
- This will give you a general sense of the vector source's response times to the agent.

Please try out the above and let us know what you find.

ee07b415 commented 11 months ago

Hi @dsmith3197 , thanks for providing the information, I will try them out, we do have the component discarded event metrics so we know we didn't lose the events

ee07b415 commented 11 months ago

Hi @dsmith3197 , the sum(rate(vector_component_sent_events_total{component_id="vector_source"}[5m])) is the same with sum(rate(vector_component_received_events_total{component_id="vector_source"}[5m])) so change the query the chart is the same

Screenshot 2023-12-13 at 12 32 54 PM

Screenshot 2023-12-13 at 12 30 09 PM

dsmith3197 commented 11 months ago

I see, thank you for checking.

I would still advise you to upgrade to Vector v0.34.1. This includes the additional server metrics mentioned above and a bug fix in sources that might help identify the issue.

Specifically, the v0.34.0 release includes the following

Sources now correctly emit a log and increment component_discarded_events_total when incoming requests are cancelled before the events are pushed to downstream components.

Other than that, it is possible that there are other clients sending sending to the vector_source in the aggregator?

vectordotdev / vector