Closed jonwinton closed 1 year ago
We're looking into this, will keep you posted, @jonwinton !
Thanks @barieom! I wanted to add this screenshot of steady state CPU usage. Orange line is CPU requests, red line is CPU limit. Looking at network traffic there is are no outliers in terms of higher network load on a single pod.
Let me know if I can supply anything else to help!
Thanks for filing this issue @jonwinton.
It looks like the vector helm chart was used for configuration of vector.
Would you be able to provide the complete helm chart values used, and the chart version you are using?
This would expedite the investigation and reproduction effort.
Definitely @neuronull !
Chart Version: 0.14.0 One comment on this: we're blocked from upgrading due to our current EKS version not having the autoscaling APIs defined in current versions of the chart. We're in the progress of updating, but it'll be awhile
We use Argo for deployment of the chart but we have a fun Helm args combining pipeline, I need some time to get the full config together.
@neuronull here we go:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- vector
topologyKey: failure-domain.beta.kubernetes.io/zone
weight: 100
autoscaling:
enabled: true
targetCPUUtilizationPercentage: 80
targetMemoryUtilizationPercentage: 60
commonLabels:
tags.datadoghq.com/service: vector
tags.datadoghq.com/version: "0.24.0"
env:
- name: DATADOG_API_KEY
valueFrom:
secretKeyRef:
key: ...
name: ...
- name: DD_HOSTNAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: VECTOR_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: VECTOR_VERSION
valueFrom:
fieldRef:
fieldPath: metadata.labels['tags.datadoghq.com/version']
extraVolumeMounts:
- mountPath: /mnt/secrets-store
name: vector-secrets
readOnly: true
extraVolumes:
- csi:
driver: secrets-store.csi.k8s.io
readOnly: true
volumeAttributes:
secretProviderClass: ...
name: vector-secrets
podDisruptionBudget:
enabled: true
minAvailable: 10%
podLabels:
tags.datadoghq.com/service: vector
podPriorityClassName: system-cluster-critical
role: Stateless-Aggregator
rollWorkload: true
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: <role id>
create: true
updateStrategy:
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
type: RollingUpdate
haproxy:
enabled: true
image:
tag: &version 2.6.6
resources:
requests:
cpu: 1000m
memory: 1Gi
limits:
cpu: 1000m
memory: 1Gi
autoscaling:
enabled: true
minReplicas: 9
maxReplicas: 50
podLabels:
tags.datadoghq.com/service: vector-haproxy
tags.datadoghq.com/version: *version
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- vector
topologyKey: failure-domain.beta.kubernetes.io/zone
weight: 100
podAnnotations:
ad.datadoghq.com/haproxy.checks: |
{
"haproxy": {
"init_config": {},
"instances": [{
"use_openmetrics": true,
"openmetrics_endpoint": "http://%%host%%:1024/metrics",
"histogram_buckets_as_distributions": true,
"collect_counters_with_distributions": true
}]
}
}
customConfig: |
global
log stdout format raw local0
maxconn 4096
stats socket /tmp/haproxy
hard-stop-after {{ .Values.haproxy.terminationGracePeriodSeconds }}s
defaults
log global
option dontlognull
retries 10
option redispatch
option allbackups
timeout client 10s
timeout server 10s
timeout connect 5s
resolvers coredns
nameserver dns1 kube-dns.kube-system.svc.cluster.local:53
resolve_retries 3
timeout resolve 2s
timeout retry 1s
accepted_payload_size 8192
hold valid 10s
hold obsolete 60s
frontend stats
mode http
bind :::1024
option httplog
http-request use-service prometheus-exporter if { path /metrics }
stats enable
stats hide-version # Hide HAProxy version
stats realm Haproxy\ Statistics # Title text for popup window
stats uri /haproxy_stats # Stats URI
stats refresh 10s
frontend datadog-agent
mode http
bind :::6000
option httplog
option dontlog-normal
default_backend datadog-agent
backend datadog-agent
mode http
balance leastconn
option tcp-check
server-template srv 10 _datadog-agent._tcp.{{ include "vector.fullname" $ }}.{{ $.Release.Namespace }}.svc.cluster.local resolvers coredns check
Chart Version: 0.14.0 One comment on this: we're blocked from upgrading due to our current EKS version not having the autoscaling APIs defined in current versions of the chart. We're in the progress of updating, but it'll be awhile
Hey @jonwinton IIRC there's nothing really stopping us from backporting the HPA options to older API's. I think I stopped where I did because there was some amount of structural change that happened between API versions and I figured if someone needed the support we'd get an issue opened.
We'd be happy to get an issue opened for that support if that would help you upgrade.
@neuronull here we go:
Thanks @jonwinton !
Forgot to ask as well- what version of the Datadog Agent are you using? Was it also upgraded or did it's version stay the same? Can you share the DD Agent config as well?
Cheers~
Still investigating this but just had an observation.
In v0.24.0
#13406 contained a fix in the datadog_metrics
sink. The sink was not properly utilizing the DD api key from the event metadata. Because your configuration is utilizing the datadog_agent
source ,and the datadog_metrics
sink and the datadog_agent
source by default forwards the API key- v0.24.0 introduced a change in behavior for your config.
I'm wondering if the problem here is that the DD API key which is being sent to Vector from the Datadog Agent's configuration , is perhaps different than the one which is being configured in the helm chart via the secrets? (Or otherwise overriding)
By default, the datadog_agent
source is configured to store the API key if it is found in the headers from the Datadog Agent requests.
To disable that, set store_api_key
to false
.
https://vector.dev/docs/reference/configuration/sources/datadog_agent/#store_api_key
Would you mind giving that a try @jonwinton ? Thanks!
Sorry for the delay @neuronull and @spencergilbert ! Had some holiday time off and everything. Let me get you that info.
@neuronull I did test setting store_api_key
and it didn't help.
Forgot to ask as well- what version of the Datadog Agent are you using? Was it also upgraded or did it's version stay the same? Can you share the DD Agent config as well?
@neuronull the version of the DataDog agent has been locked to 7.39.1
for the duration of this test.
We'd be happy to get an issue opened for that support if that would help you upgrade.
@spencergilbert I think we're going to be stuck on the autoscaling/v1 API for the next 3-6 months, so if it's possible to support those versions that would be amazing π
@jonwinton I see you mentioned you tried setting store_api_key
. To clarify, did you try setting it to false
? That is: store_api_key: false
on the datadog_agent
source. As @neuronull pointed out there was a fix in 0.24.0 to start using the API key that Agent sent metrics with when forwarding to Datadog, by default, to match the behavior of logs.
Also, is that the complete Helm configuration? I expected to see the Vector configuration in it, but I just see the HAProxy configuration.
@jszwedko yeah, tried setting it to false
and it didn't change anything.
That is the full configuration other than the Vector configuration in the original issue body, which I'll re-paste below:
data_dir: /vector-data-dir
api:
enabled: true
address: 127.0.0.1:8686
playground: false
sources:
internal_logs:
type: internal_logs
# Datadog Agent telemetry
datadog_agent:
type: datadog_agent
address: "0.0.0.0:6000"
multiple_outputs: true # To automatically separate metrics and logs
store_api_key: false
sinks:
console:
type: console
inputs:
- internal_logs
target: stdout
encoding:
codec: json
# Datadog metrics output
datadog_metrics:
type: datadog_metrics
inputs:
- <inputs>...
api_key: "${DATADOG_API_KEY}"
In
v0.24.0
#13406 contained a fix in thedatadog_metrics
sink.
Just correcting myself- That change I was pointing out was included in v0.23.0, so that whole line of inquiry which followed would be invalid.
Hi @jonwinton,
I've been trying to reproduce this locally and am not having much luck so far. :disappointed: Will summarize what I've tried and some observations, and maybe something will jump out at you.
Additional questions for you:
General observations:
The message "error shutting down IO" from the Vector logs, seems to be either (likely I suspect) benign or it's not a new issue with 0.24.0, or not specifically tied to the 504 error observed in the Agent logs.
hyper
crateSo far I haven't been able to get a 504 error to show in the Agent logs
My setup:
Permutations I've tried:
dogstastd program
from datadog import initialize, statsd
import time
import random
import sys
def log(msg):
sys.stdout.write('{}: {}\n'.format(time.strftime("%H:%M:%S"), msg))
sys.stdout.flush()
options = {
'statsd_host':'172.17.0.3',
'statsd_port':8125
}
log("initializing")
initialize(**options)
log("initialized")
while (1):
log("distribution")
for n in range(0,99):
name = "foo_metric.distribution_" + str(n)
value = random.randint(0,999)
statsd.distribution(name, value, tags=["environment:baz"])
time.sleep(1)
Agent helm values
datadog:
apiKey: <redacted>
containerExclude: "name:vector"
logs:
enabled: true
containerCollectAll: true
apm:
enabled: false
clusterAgent:
enabled: false
agents:
useConfigMap: true
image:
tag: 7.39.1
customAgentConfig:
kubelet_tls_verify: false
vector:
logs:
enabled: true
url: "http://vector-haproxy.default:8282"
metrics:
enabled: true
url: "http://vector-haproxy.default:8282"
dogstatsd_non_local_traffic: true
ports:
- containerPort: 8125
hostPort: 8125
name: dogstatsdport
protocol: UDP
Vector helm values
Effectively the same as your provided values , with the exception of tagging specific images, and hard coding the DD API key.
Hey! Thanks for all the details! I'll try and answer everything here and will come back with deeper answers for some things I need to retrieve/update log levels for.
Do you have any additional log snippets you could provide? For example, is Vector logging any Errors (not debug logs). And, perhaps the error logs from HAProxy?
Let me go get some of those a little later. Pre-planned DR game day going on so a little distracted π¬
Since we're having trouble reproducing it, would you be open to trying some nightly builds between v0.23.3 and 0.24.0, and observing the first build which exhibits the issue? This could give us hints to which commits might have caused the issue and additionally raise confidence in the issue.
π€¦ I'm sad I didn't think about doing this already. I will definitely do this.
Have you attempted this upgrade from 0.23.3 to 0.24.0 multiple times or just once?
Multiple times!
Have you performed similar upgrades of Vector in this environment before?
Yeah, we jumped onto vector in the 0.1x versions and slowly bumped up each version until 0.24.0
One other piece of context that might be helpful: this issue first appeared in our largest staging environment, so I'm wondering if it's related to volume. Are y'all running load testing benchmarks in CI? Screenshot to show the load where we're first encountering the error.
I will definitely do this.
Thanks a bunch!
One other piece of context that might be helpful: this issue first appeared in our largest staging environment, so I'm wondering if it's related to volume. Are y'all running load testing benchmarks in CI? Screenshot to show the load where we're first encountering the error.
We do some sense of load testing within our soak regression tests- generally those soak tests will try to hit vector with the allocated resources available. But we don't have explicit testing to evaluate usage under different loads.
It is good to know that it might be volume related- I can take that into consideration for repro efforts going forward.
So sorry for the delay @neuronull, but I'm working through the nightlies now. Will get you a report this week so it's ready for the new year π
@neuronull ok here are the results:
nightly-2022-08-17-distroless-libc
tagThanks for bisecting that down @jonwinton ! Just to confirm, do I take the above to mean that nightly-2022-08-16-distroless-libc
did not exhibit the issue?
@jszwedko correct! Here's a screenshot showing the version
was nightly-2022-08-16-distroless-libc
right before deploying the 2022-08-17
tag and everything was working as expected.
I'm working with DataDog Support to give you access to a notebook with the entire breakdown, but screenshots for now.
Thanks @jonwinton ! That would seem to narrow down the suspicious commits to https://github.com/vectordotdev/vector/compare/ba1e1c4...01c7a36 then. Nothing immediately jumps out, but it should give us a much better place to start investigating from. The person investigating this is OOO this week, but we'll pick it back up in the new year. Appreciate all of the added context!
Sounds good! Yeah I was beginning to look at https://github.com/vectordotdev/vector/pull/13973 since we only see the errors come up for distribution data, but I need a Rust pro to help out. That said, if any tagged releases can be pushed out that have any changes to test in our env, I can easily deploy them.
See y'all next year!
Thanks for going through that effort and providing all that data @jonwinton , really helps!
I believe the hdrhistogram crate updated in #13973 can be ruled out as it only seems to be used in some example code.
After looking at that commit range, the one that stands out most suspiciously is #13960.
At first glance it looks like its unrelated because the code is only operating over Counter
metric types, and here the affect type is distribution.
But then I noticed this in the helm values ...
"collect_counters_with_distributions": true
I'm currently working on finding out what that actually translates to in terms of the Agent's behavior.
But so far that commit is really the only one that jumps out to me.
We could also try a test of a private build containing the commits from 8-17 but minus this one.
@neuronull dang, nice digging! π
A private image works, or if you can give me the build commands for generating the libc image I can push an image into our private ECR repo. I tried digging into the build pipeline a bit, but my lack of Rust familiarity is holding me back a bit.
Thanks @jonwinton !
Here are the commands to get to the correct state of the repo:
test = ["vector-common/test"]
To build the docker image to your local cache:
Let us know how that goes or if you have any problems! Very curious to hear the result.
@jonwinton , sorry for the spam- Ignore that last comment's instructions (there are some incorrect bits). We're working on improving the procedure to follow. In the meantime, I'll create the image and push it to the vector repo for you.
Alright, this image should do the trick @jonwinton. It contains that nightly build minus ef22eb9.
timberio/vector:39c98e3-distroless-libc
Perfect! I'll test this now
@neuronull confirmed that we don't see the same issue with this version! Thank you for pushing that version out π
I guess next steps would be entirely on y'all's end?
Oh dang, looking at that PR more, we're interested in maintaining the fix for this bug: https://github.com/vectordotdev/vector/issues/13870
We're also dealing with interval
issues and this would be helpful once we can safely upgrade π¬
Great to hear that the build had the expected outcome!
Yes, the next steps are for us to figure out what is wrong with that commit :)
We'll definitely want to maintain the fix functionality. Good to know you also require that. Will keep this thread posted on progress~
Hi @jonwinton , I believe this issue may have been solved by #14741 , which was included in v0.24.2.
Would you mind trying that build?
I think we just didn't realize that it might have been already fixed in a patch release, because on the surface it looked so unrelated to any fixes in v0.24.1 or v0.24.2.
Hmm I can retry later, but I went as high as 0.25.0 and still had the issue...I will confirm tomorrow
Interesting π€ . Ok thank you π
@neuronull confirmed it's still borked in the v0.26.0
release π
Thanks for the follow-up on that @jonwinton !
In looking at your screenshots, looks like you have vector running in multiple pods. I'm just curious if you could describe how many instances you have of the Datadog Agent running, and how many instances of Vector running?
I'm investigating another issue that may overlap with this one. Thanks!
Would you also be willing to share your HAProxy config ?
@neuronull the HAProxy config is here, but pasting below without the Vector config:
haproxy:
enabled: true
image:
tag: &version 2.6.6
resources:
requests:
cpu: 1000m
memory: 1Gi
limits:
cpu: 1000m
memory: 1Gi
autoscaling:
enabled: true
minReplicas: 9
maxReplicas: 50
podLabels:
tags.datadoghq.com/service: vector-haproxy
tags.datadoghq.com/version: *version
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- vector
topologyKey: failure-domain.beta.kubernetes.io/zone
weight: 100
podAnnotations:
ad.datadoghq.com/haproxy.checks: |
{
"haproxy": {
"init_config": {},
"instances": [{
"use_openmetrics": true,
"openmetrics_endpoint": "http://%%host%%:1024/metrics",
"histogram_buckets_as_distributions": true,
"collect_counters_with_distributions": true
}]
}
}
customConfig: |
global
log stdout format raw local0
maxconn 4096
stats socket /tmp/haproxy
hard-stop-after {{ .Values.haproxy.terminationGracePeriodSeconds }}s
defaults
log global
option dontlognull
retries 10
option redispatch
option allbackups
timeout client 10s
timeout server 10s
timeout connect 5s
resolvers coredns
nameserver dns1 kube-dns.kube-system.svc.cluster.local:53
resolve_retries 3
timeout resolve 2s
timeout retry 1s
accepted_payload_size 8192
hold valid 10s
hold obsolete 60s
frontend stats
mode http
bind :::1024
option httplog
http-request use-service prometheus-exporter if { path /metrics }
stats enable
stats hide-version # Hide HAProxy version
stats realm Haproxy\ Statistics # Title text for popup window
stats uri /haproxy_stats # Stats URI
stats refresh 10s
frontend datadog-agent
mode http
bind :::6000
option httplog
option dontlog-normal
default_backend datadog-agent
backend datadog-agent
mode http
balance leastconn
option tcp-check
server-template srv 10 _datadog-agent._tcp.{{ include "vector.fullname" $ }}.{{ $.Release.Namespace }}.svc.cluster.local resolvers coredns check
'm just curious if you could describe how many instances you have of the Datadog Agent running, and how many instances of Vector running?
So this comes up in our staging environment where we have the following:
When we deploy any SHA/version beyond v0.23.3, the HPA on Vector tries to run 100+ Vector pods and we still see requests consistently failing even when HPA scales out.
Aside from the symptoms outlined in the initial comment @jonwinton , are you by any chance seeing a significant reduction in any metrics when observed in the Datadog UI ? Specifically, count
and rate
?
@neuronull it's hard to tell because anytime we go beyond v0.23.3
we see such a huge increase in errors that we see reductions across the board. Because it takes a certain amount of traffic from the DD Agent --> Vector to notice this, I have to deploy into a env where our engineers are working and so I can never keep it on for too long.
Specifically, count and rate ?
So the fun thing about all this: we're run an entirely Prometheus-based ecosystem and so we don't submit any rate
metrics, only count
/gauge
/historgram
/summary
types.
But regarding your question, looking at the test time from yesterday with v0.26.0
, these error charts show error counts at the HAProxy layer and show that the Vector backend begins having issues responding. The 2xx rate drops so I'm assuming the requests are retried and make it through OR they're dropped entirely. But since I don't have a great way of identifying which count
metrics decrease.
This is a screenshot of our of our most popular count
metrics. While we do see an initial drop in datadpoints, it seems to get back to normal as the number of Vector pods scales out.
Does that help at all? Not a great answer, but I'm hoping some of that was useful
Does that help at all? Not a great answer, but I'm hoping some of that was useful
While not the "smoking gun" I was looking for, it is all useful info. Thanks!
Hi @jonwinton , I have a build that contains a fix candidate. Because I've yet to be able to repro this locally, it's a bit of a shot in the dark. But it essentially is a partial revert of some of the changes that went into that commit we identified earlier, while still maintaining the desired behavior of aggregation of metrics.
This is the image: timberio/vector:9c2662cfa_01-13-2023-distroless-libc
If you wouldn't mind testing that out, it'd be appreciated!
Note that this one is based from v0.25.2
Of course! Let me check this now!
@neuronull unfortunately the issue is still around. Seems like
I definitely did see missing count metrics while this version was deployed but when I rolled back the test the values did fill in. The data seemed to be buffered in the DD Agents and then backfilled once I began the rollback.
Thanks so much for the quick turn around. And good to know that you also observed a decrease in count metrics.
I just found another potential problem. Unfortunately as I haven't been able to repro locally, your efforts to try builds are currently our best method of narrowing down the root cause.
If you don't mind another, I'm prepping a build now...
Not at all! Bring them on π
timberio/vector:86635f66b_2023-01-12-distroless-libc
π€
@neuronull sorry for the delay on that! Same results as last time, but I was able to grab a screenshot showing some missing counter metrics for the same charts that we've been referencing:
After rollback the data does fill in:
But in general, same results
No worries! Thanks for providing that additional data about the missing count metrics. This is helpful. I'll regroup here and see about next steps. I'm hopeful since we have been able to identify the problematic commit. Will keep you posted~
To re-confirm @jonwinton, you're using dogstatsd as an ingestion method, is that correct?
A note for the community
Problem
We noticed the following issue when upgrading from
v0.23.3
tov0.24.0
:(1) CPU became spikey/less consistent
The charts below show how CPU explodes, Datadog forwarder error rates climb, HAProxy 5xx rates climb
(2) 504 error codes coming from the Datadog Agents writing to Vector for only the
/api/beta/sketches
endpointThis is also apparent in errors surfacing from HAProxy (deployed via the Vector Helm chart). HAProxy is using a
leastconn
balance strategy.(3) Error logs coming from Vector for shutting down connections
Configuration
Version
0.24.0-distroless-libc
Debug Output
Example Data
I'm not sure what the Datadog Agent is sending to this endpoint
Additional Context
We're running in AWS EKS 1.21 in
References
No response