Open smitthakkar96 opened 1 year ago
Hi @smitthakkar96 !
Thanks for this report! Could you provide the configuration file you are using? That would be helpful in reproducing.
Here is the configuration, sorry I forgot to put it earlier. Please let me know if you need anything else :)
vector.toml
data_dir = "/vector-data-dir"
[api]
enabled = true
address = "127.0.0.1:8686"
sources/datadog_agents.toml
type = "datadog_agent"
address = "[::]:8080"
multiple_outputs = false
sources/internal_logs.toml
type = "internal_logs"
transforms/remap_enrich_internal_metrics_with_static_tags.toml
type = "remap"
inputs = [ "internal_metrics" ]
source = """
.tags.foo = "bar"
.tags.foo1 = "bar1"
"""
transforms/remap_logs_for_datadog.toml
type = "remap"
inputs = [ "datadog_agents" ]
source = """
.host = del(.hostname)
.ddtags = parse_key_value!(.ddtags, key_value_delimiter: ":",
field_delimiter: ",")
.ddtags.vector_aggregator = get_hostname!()
.ddtags.env = "staging"
# Re-encode Datadog tags as a string for the `datadog_logs` sink
.ddtags = encode_key_value(.ddtags, key_value_delimiter: ":",
field_delimiter: ",")
# Datadog Agents pass a "status" field that is stripped when ingested
del(.status)
"""
sinks/datadog_logs.toml
type = "datadog_logs"
inputs = [ "remap_logs_for_datadog", "internal_logs" ]
default_api_key = "${DATADOG_API_KEY}"
site = "datadoghq.eu"
buffer.type = "disk"
buffer.max_size = 268435488
sinks/datadog_metrics.toml
type = "datadog_metrics"
inputs = [ "remap_enrich_internal_metrics_with_static_tags" ]
default_api_key = "${DATADOG_API_KEY}"
site = "datadoghq.eu"
buffer.type = "disk"
buffer.max_size = 268435488
I've also added How did you generate the load?
in the original description
Just checking the graphs again, during all these load tests, there were stages where buffers got full, it's expected the buffers to block incoming events but after the buffer is flushed other sinks start to behave normally but datadog_metrics sinks stop to receive metrics from internal_metrics
source. I will do another load test tomorrow to verify if this exclusively happens when the buffer gets full.
We did another load test today and observed that when the pods get saturated, the internal_metrics
keep coming until the buffer is full. Once we stop the load test and the buffer has free space it should allow for new events and internal_metrics should keep flowing.
I just noticed today, after restarting Vector we saw that our buffer byte size keeps growing.
Interesting, thanks for the details and configuration! I wouldn't expect backpressure in the pipeline that is sending from the datadog_agent
source to the datadog_logs
sink to affect your pipeline that is sending internal_metrics
to datadog_metrics
since they are completely separate pipelines. My one guess is that that pipeline is somehow being starved, but I'm not sure how that could happen.
The other behavior seems normal if Vector is not keeping up with the input: that the buffer will grow. Here it seems to be bumping up against the CPU limits you set. I'm curious also about the volumes you are backing the disk buffers with. You seem to be in AWS: are these EBS volumes? I'm wondering if they could be constraining the throughput.
@jszwedko thanks for your response, we have some more findings, and I think that might help bring some clarity
We deployed Vector and didn't run any load tests this time; instead just let Vector run for a few hours with regular staging traffic. We saw a few weird things:
internal_metrics
to Datadog, is this expected?)vector top
as well, there is no update it just shows --/s
)I'm also curious about the volumes you are backing the disk buffers with. You seem to be in AWS: are these EBS volumes? Could they be constraining the throughput?
Yes, these EBS volumes are attached to Vector Aggregator pods.
Would it be helpful if I reproduce this in a kind cluster and send over the manifests?
For time being we switched to Prometheus exporter sink and added auto-discovery annotations on the pod so Datadog Agent can scrape these metrics but it would be good if we can send these metrics directly.
Yesterday there was an outage on Datadog
We were curious about how Vector behaved; hence we opened the dashboard. As you can see, at one point when the buffer got full, the throughput went to 0
, and then when Datadog recovered, we can see throughput increasing, upon checking the logs, we can see the logs are also being sent just fine, so everything works as expected though if you see the buffer byte size it never went down. As you can see in the screenshot below, the CPU and Memory usage also never went down. I would like to know if this concerns telemetry or Vector itself.
Interesting, thanks for sharing that @smitthakkar96 .
One guess I have is that, though throughput recovered, the sink still isn't sending events fast enough to drain the buffer. You could verify that by comparing the component_sent_events_total
metric for the sink with buffer_received_events_total
to see if the sent events is exceeding the incoming events.
We have what appears to be the same problem in our vector agent with k8s log as source and loki as sink. the loki sink use 1GiB disk buffer It behaves as a vector agent that takes up all the cpu of the pod limit, but the backend doesn't receive any logs At the same time we can see that the evnet buffer fills up quickly and stops falling
vector_buffer_byte_size{component_type="loki"}/vector_buffer_max_byte_size{component_type="loki"}
the component_sent_events_total and buffer_received_events_total Reduced to 0
increase(vector_buffer_received_events_total{component_type="loki"}[1m])
increase(vector_component_sent_events_total{component_type="kubernetes_logs"}[1m])
This state will remain until we restart the vector agent. We can make sure that loki is working and that the network is working, and that the other vector agents in the same cluster are working properly
@z2665 do you see anything interesting in the log output? One thought would be that the buffer reader got stuck. Also, which version of Vector?
@z2665 do you see anything interesting in the log output? One thought would be that the buffer reader got stuck. Also, which version of Vector?
The logs are normal. There's no output of any value. This issue occurs several times in our production environment with versions 0.23 and 0.28. However, the frequency is very low
Unfortunately, we observed the same problem on the vector upgraded to 0.31. This issue is not resolved
Hi @smitthakkar96,
The original issue mentioned should be resolved in the new v0.34.0 release. There was a performance bottleneck in the Datadog Metrics sink that has now been fixed. Can you try it out and let us know if it resolves your issue?
A note for the community
Problem
Hello there,
We (Delivery Hero) are in process of adopting Vector. We deployed Vector on our staging environment and started load testing it. We noticed that the
internal_metrics
stops updating when the pods get saturated. We use the datadog_metrics sink to send internal_metrics to Datadog. Running vector top shows that internal metrics are not flushing more metrics, whereas other sinks like Datadog Logs and S3 used to ship k8s logs to Datadog and S3 keep working fine. The impact is a lack ofinternal_metrics
until Vector is restarted. Attached is the screenshot of our monitoring dashboard. As you can see, logs from Vector keep coming, but panels that use internal metrics are blank.Possible duplicate of https://github.com/vectordotdev/vector/issues/13718 however I am not sure.
How did you generate the load?
We ran the load test on an EKS cluster running
c6g.xlarge
with 3 Vector pods having requests and limits of2000m
. To generate load, we deployed another 120-150 pods running Vectordemo_logs
source withstdout
sink. These demo_logs was tailed by Datadog Agent and later forwarded to Vector.Configuration
No response
Version
0.27.0
Debug Output
No response
Example Data
No response
Additional Context
No response
References
No response