Internal Metrics Hangs Forever

smitthakkar96 commented 1 year ago

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Hello there,

We (Delivery Hero) are in process of adopting Vector. We deployed Vector on our staging environment and started load testing it. We noticed that the internal_metrics stops updating when the pods get saturated. We use the datadog_metrics sink to send internal_metrics to Datadog. Running vector top shows that internal metrics are not flushing more metrics, whereas other sinks like Datadog Logs and S3 used to ship k8s logs to Datadog and S3 keep working fine. The impact is a lack of internal_metrics until Vector is restarted. Attached is the screenshot of our monitoring dashboard. As you can see, logs from Vector keep coming, but panels that use internal metrics are blank.

Possible duplicate of https://github.com/vectordotdev/vector/issues/13718 however I am not sure.

How did you generate the load?

We ran the load test on an EKS cluster running c6g.xlarge with 3 Vector pods having requests and limits of 2000m. To generate load, we deployed another 120-150 pods running Vector demo_logs source with stdout sink. These demo_logs was tailed by Datadog Agent and later forwarded to Vector.

Configuration

No response

Version

0.27.0

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

jszwedko commented 1 year ago

Hi @smitthakkar96 !

Thanks for this report! Could you provide the configuration file you are using? That would be helpful in reproducing.

smitthakkar96 commented 1 year ago

Here is the configuration, sorry I forgot to put it earlier. Please let me know if you need anything else :)

vector.toml

data_dir = "/vector-data-dir"

[api]
enabled = true
address = "127.0.0.1:8686"

sources/datadog_agents.toml

type = "datadog_agent"
address = "[::]:8080"
multiple_outputs = false

sources/internal_logs.toml

type = "internal_logs"

transforms/remap_enrich_internal_metrics_with_static_tags.toml

type = "remap"
inputs = [ "internal_metrics" ]
source = """
.tags.foo = "bar"
.tags.foo1 = "bar1"
"""

transforms/remap_logs_for_datadog.toml

type = "remap"

inputs = [ "datadog_agents" ]

source = """

.host = del(.hostname)

.ddtags = parse_key_value!(.ddtags, key_value_delimiter: ":",
field_delimiter: ",")

.ddtags.vector_aggregator = get_hostname!()

.ddtags.env = "staging"

# Re-encode Datadog tags as a string for the `datadog_logs` sink

.ddtags = encode_key_value(.ddtags, key_value_delimiter: ":",
field_delimiter: ",")

# Datadog Agents pass a "status" field that is stripped when ingested

del(.status)
"""

sinks/datadog_logs.toml

type = "datadog_logs"
inputs = [ "remap_logs_for_datadog", "internal_logs" ]
default_api_key = "${DATADOG_API_KEY}"
site = "datadoghq.eu"
buffer.type = "disk"
buffer.max_size = 268435488

sinks/datadog_metrics.toml

type = "datadog_metrics"
inputs = [ "remap_enrich_internal_metrics_with_static_tags" ]
default_api_key = "${DATADOG_API_KEY}"
site = "datadoghq.eu"
buffer.type = "disk"
buffer.max_size = 268435488

I've also added How did you generate the load? in the original description

smitthakkar96 commented 1 year ago

Just checking the graphs again, during all these load tests, there were stages where buffers got full, it's expected the buffers to block incoming events but after the buffer is flushed other sinks start to behave normally but datadog_metrics sinks stop to receive metrics from internal_metrics source. I will do another load test tomorrow to verify if this exclusively happens when the buffer gets full.

smitthakkar96 commented 1 year ago

We did another load test today and observed that when the pods get saturated, the internal_metrics keep coming until the buffer is full. Once we stop the load test and the buffer has free space it should allow for new events and internal_metrics should keep flowing.

smitthakkar96 commented 1 year ago

I just noticed today, after restarting Vector we saw that our buffer byte size keeps growing.

jszwedko commented 1 year ago

Interesting, thanks for the details and configuration! I wouldn't expect backpressure in the pipeline that is sending from the datadog_agent source to the datadog_logs sink to affect your pipeline that is sending internal_metrics to datadog_metrics since they are completely separate pipelines. My one guess is that that pipeline is somehow being starved, but I'm not sure how that could happen.

The other behavior seems normal if Vector is not keeping up with the input: that the buffer will grow. Here it seems to be bumping up against the CPU limits you set. I'm curious also about the volumes you are backing the disk buffers with. You seem to be in AWS: are these EBS volumes? I'm wondering if they could be constraining the throughput.

smitthakkar96 commented 1 year ago

@jszwedko thanks for your response, we have some more findings, and I think that might help bring some clarity

Screenshot 2023-03-01 at 16 30 13

Screenshot 2023-03-01 at 16 32 38

We deployed Vector and didn't run any load tests this time; instead just let Vector run for a few hours with regular staging traffic. We saw a few weird things:

Incoming is much higher than outgoing, and memory keeps growing despite disk buffers (Is it possible there is a memory leak? 2 CPUs with normal load aren't able to flush all internal_metrics to Datadog, is this expected?)
After the disk buffer gets full due to incoming bytes being higher than outgoing, no new metric from Vector is sent. From what I understood, the events that are already in the buffer will keep flushing right. It will only block new events and that too until the buffer is full. In our case buffer never gets empty, after it gets full (checked this with vector top as well, there is no update it just shows --/s)

I'm also curious about the volumes you are backing the disk buffers with. You seem to be in AWS: are these EBS volumes? Could they be constraining the throughput?

Yes, these EBS volumes are attached to Vector Aggregator pods.

Would it be helpful if I reproduce this in a kind cluster and send over the manifests?

For time being we switched to Prometheus exporter sink and added auto-discovery annotations on the pod so Datadog Agent can scrape these metrics but it would be good if we can send these metrics directly.

smitthakkar96 commented 1 year ago

Yesterday there was an outage on Datadog

We were curious about how Vector behaved; hence we opened the dashboard. As you can see, at one point when the buffer got full, the throughput went to 0, and then when Datadog recovered, we can see throughput increasing, upon checking the logs, we can see the logs are also being sent just fine, so everything works as expected though if you see the buffer byte size it never went down. As you can see in the screenshot below, the CPU and Memory usage also never went down. I would like to know if this concerns telemetry or Vector itself.

jszwedko commented 1 year ago

Interesting, thanks for sharing that @smitthakkar96 .

One guess I have is that, though throughput recovered, the sink still isn't sending events fast enough to drain the buffer. You could verify that by comparing the component_sent_events_total metric for the sink with buffer_received_events_total to see if the sent events is exceeding the incoming events.

z2665 commented 1 year ago

We have what appears to be the same problem in our vector agent with k8s log as source and loki as sink. the loki sink use 1GiB disk buffer It behaves as a vector agent that takes up all the cpu of the pod limit, but the backend doesn't receive any logs At the same time we can see that the evnet buffer fills up quickly and stops falling

vector_buffer_byte_size{component_type="loki"}/vector_buffer_max_byte_size{component_type="loki"}

the component_sent_events_total and buffer_received_events_total Reduced to 0

increase(vector_buffer_received_events_total{component_type="loki"}[1m])

increase(vector_component_sent_events_total{component_type="kubernetes_logs"}[1m])

This state will remain until we restart the vector agent. We can make sure that loki is working and that the network is working, and that the other vector agents in the same cluster are working properly

jszwedko commented 1 year ago

@z2665 do you see anything interesting in the log output? One thought would be that the buffer reader got stuck. Also, which version of Vector?

z2665 commented 1 year ago

@z2665 do you see anything interesting in the log output? One thought would be that the buffer reader got stuck. Also, which version of Vector?

The logs are normal. There's no output of any value. This issue occurs several times in our production environment with versions 0.23 and 0.28. However, the frequency is very low

z2665 commented 1 year ago

Unfortunately, we observed the same problem on the vector upgraded to 0.31. This issue is not resolved

dsmith3197 commented 1 year ago

Hi @smitthakkar96,

The original issue mentioned should be resolved in the new v0.34.0 release. There was a performance bottleneck in the Datadog Metrics sink that has now been fixed. Can you try it out and let us know if it resolves your issue?

https://github.com/vectordotdev/vector/pull/18759

vectordotdev / vector