vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
18.04k stars 1.59k forks source link

Inconsistencies in `vector_component_*` Vector metrics #17329

Closed skygrammas closed 1 year ago

skygrammas commented 1 year ago

A note for the community

Problem

I'm trying to monitor the amount of dropped log events for a humio sink. Upon observing the vector_component_sent_event_bytes_total and vector_component_received_event_bytes_total metrics, I am seeing that the amount of bytes that vector_component_sent_event_bytes_total reports is significantly higher than the amount of bytes reported for vector_component_received_event_bytes_total. I've scoped both metrics down to the humio sink component, so I don't expect that there's any difference in the amount of bytes that the component receives versus what it sends as there is no transformation of log data within the component. Additionally, I could understand the amount of bytes received being higher than the amount sent and interpreting that as the sink is dropping events, however the amount sent being higher than the amount received without any transformation within the component is puzzling (and concerning).

I would expect these two metric values would be tightly bound for a sink component that is not dropping logs. I would not expect the amount of bytes sent to be significantly higher than the amount of bytes received for a component that is not transforming the data it receives.

Screenshot 2023-05-05 at 3 26 56 PM

Configuration

customConfig:
  data_dir: /vector-data-dir
  api:
    enabled: true
    address: 0.0.0.0:8686
  log_schema:
    host_key: host
    message_key: message
    source_type_key: source_type
    timestamp_key: timestamp
  sources:
    kubernetes_logs:
      type: kubernetes_logs
      self_node_name: ${NODE_NAME}
      glob_minimum_cooldown_ms: 2000
      max_line_bytes: 262144
      max_read_bytes: 10485760
    internal_logs:
      type: internal_logs
    vector_metrics:
      type: internal_metrics
  sinks:
    humio:
      type: humio_logs
      inputs:
        - kubernetes_logs
      token: "${HUMIO_INGEST_TOKEN}"
      endpoint: "${HUMIO_INGEST_ENDPOINT}"
      request:
        concurrency: adaptive
      compression: gzip
      encoding:
        codec: json
    prometheus_exporter:
      type: prometheus_exporter
      address: 0.0.0.0:9598
      default_namespace: vector
      inputs:
        - vector_metrics

Vector version: 0.26.0 Helm chart verison: 0.18.0 Kubernetes version: 1.23

Version

Vector version 0.26.0

skygrammas commented 1 year ago

Perhaps the HTTP request metadata is inflating the amount of bytes sent. Is Vector considering the entire packet size or just the data portion?

spencergilbert commented 1 year ago

👋 I haven't gotten around to investigating this yet but component_event_bytes_total should be incrementing by The estimated JSON byte size of all events sent. The matching received metric should be doing the same.

component_sent_bytes_total uses just the Content-Length header for HTTP requests.

skygrammas commented 1 year ago

👋 I haven't gotten around to investigating this yet

Hey @spencergilbert👋, no worries. I was discussing this concern with my team this morning and my teammate hypothesized the request metadata may be inflating the metric, so I just wanted to tack that onto this thread before I forgot.

component_sent_bytes_total uses just the Content-Length header for HTTP requests.

So it seems that the request metadata is not (or, at least, should not) be playing a factor here; good to know. Looking forward to hearing back whenever you or your team get an opportunity to look more into this.

neuronull commented 1 year ago

Actually it looks like this sink was using the size in bytes :

https://github.com/vectordotdev/vector/blob/38c3f0be7b7d72ffa7d64976d8ce1d0ddb52f692/src/sinks/splunk_hec/common/service.rs#L117

, until we just changed that to use the JSON estimated size, in this commit-

https://github.com/vectordotdev/vector/commit/3b2a2be1b075344a92294c1248b09844f895ad72

, which I believe will impact your measurements. That would not be released I believe until v0.31.

skygrammas commented 1 year ago

Actually it looks like this sink was using the size in bytes :

https://github.com/vectordotdev/vector/blob/38c3f0be7b7d72ffa7d64976d8ce1d0ddb52f692/src/sinks/splunk_hec/common/service.rs#L117

, until we just changed that to use the JSON estimated size, in this commit-

3b2a2be

, which I believe will impact your measurements. That would not be released I believe until v0.31.

These metrics are being exported via the prometheus_exporter sink in the configuration shared above. I don't think [or understand how] the bug found in the splunk snippet you shared would impact the bug I'm sharing.

neuronull commented 1 year ago

These metrics are being exported via the prometheus_exporter sink in the configuration shared above. I don't think [or understand how] the bug found in the splunk snippet you shared would impact the bug I'm sharing.

The screenshot shows a filter on the humio_logs sink. That sink is a wrapper over the Splunk HEC logs sink.

skygrammas commented 1 year ago

These metrics are being exported via the prometheus_exporter sink in the configuration shared above. I don't think [or understand how] the bug found in the splunk snippet you shared would impact the bug I'm sharing.

The screenshot shows a filter on the humio_logs sink. That sink is a wrapper over the Splunk HEC logs sink.

Ah, okay. Thank you, understood. Would the byte measurement (byte size or JSON estimated byte size) address the discrepancy in the amount sent and received for the sink? The biggest concern is the delta in the byte amounts sent and received for a component, a humio sink, that is not doing any transformation on the data.

neuronull commented 1 year ago

Ah, okay. Thank you, understood.

You're welcome!

Would the byte measurement (byte size or JSON estimated byte size) address the discrepancy in the amount sent and received for the sink?

I do believe this would account for the discrepancy.

In the version of vector from your capture, vector_component_received_event_bytes_total was calculated from the JSON estimated byte size, and vector_component_sent_event_bytes_total was calculated using the in-memory size in bytes.

Thus I believe this should be fixed in the v0.31.0 release.

neuronull commented 1 year ago

Thus I believe this should be fixed in the v0.31.0 release.

An alternative to waiting for that release would be to try out the nightly build.

trennepohl commented 1 year ago

I've noticed this issue a while ago and asked around in Discord if was the metric was being calculated based on uncompressed/compressed req body.

Anyway I just tested the nightly build and this is the result

Metrics in 0.31 looks much more reliable

The yellow line is the input from a vector-forwarder and the green line is the output of a vector-aggregator

screenshot_2023-07-05_at_14.37.40.png

neuronull commented 1 year ago

Hey that's great to hear this feedback @trennepohl , thanks! I'll go ahead and close this issue then.