Datadog Agent Source Regression in v0.24x

jonwinton commented 1 year ago

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

We noticed the following issue when upgrading from v0.23.3 to v0.24.0:

(1) CPU became spikey/less consistent

The charts below show how CPU explodes, Datadog forwarder error rates climb, HAProxy 5xx rates climb

Screenshot 2022-11-17 at 3 24 52 PM

(2) 504 error codes coming from the Datadog Agents writing to Vector for only the /api/beta/sketches endpoint

2022-11-15 22:52:12 UTC | CORE | ERROR | (pkg/forwarder/worker.go:184 in process) | Error while processing transaction: error "504 Gateway Time-out" while sending transaction to "http://vector-haproxy.vector.svc.cluster.local:6000/api/beta/sketches", rescheduling it: "<html><body><h1>504 Gateway Time-out</h1>\nThe server didn't respond in time.\n</body></html>\n"

This is also apparent in errors surfacing from HAProxy (deployed via the Vector Helm chart). HAProxy is using a leastconn balance strategy.

(3) Error logs coming from Vector for shutting down connections

{"host":"vector-599576bd9b-w2bq7","message":"error shutting down IO: Transport endpoint is not connected (os error 107)","metadata":{"kind":"event","level":"DEBUG","module_path":"hyper::proto::h1::conn","target":"hyper::proto::h1::conn"},"pid":1,"source_ty
pe":"internal_logs","timestamp":"2022-11-16T20:11:46.533762489Z"}
{"host":"vector-599576bd9b-w2bq7","message":"connection error: error shutting down connection: Transport endpoint is not connected (os error 107)","metadata":{"kind":"event","level":"DEBUG","module_path":"hyper::server::server::new_svc","target":"hyper::se
rver::server::new_svc"},"pid":1,"source_type":"internal_logs","timestamp":"2022-11-16T20:11:46.533777847Z"}

Configuration

data_dir: /vector-data-dir

api:
  enabled: true
  address: 127.0.0.1:8686
  playground: false

sources:
  internal_logs:
    type: internal_logs

  # Datadog Agent telemetry
  datadog_agent:
    type: datadog_agent
    address: "0.0.0.0:6000"
    multiple_outputs: true # To automatically separate metrics and logs

sinks:
  console:
    type: console
    inputs:
      - internal_logs
    target: stdout
    encoding:
      codec: json

  # Datadog metrics output
  datadog_metrics:
    type: datadog_metrics
    inputs:
      - <inputs>...
    api_key: "${DATADOG_API_KEY}"

Version

0.24.0-distroless-libc

Debug Output

I can only recreate this issue in critical environments where I can't create this output information :(

Example Data

I'm not sure what the Datadog Agent is sending to this endpoint

Additional Context

We're running in AWS EKS 1.21 in

References

No response

jonwinton commented 1 year ago

@neuronull we use in-app Prometheus clients to generate metrics that are then collected with the DataDog Agent OpenMetrics integration (docs). Then The DD Agent forwards them (though not sure the exact format) to Vector (docs).

neuronull commented 1 year ago

Hi @jonwinton,

We are theorizing now about the high volume of your environment and potentially that the aggregation algorithm might not be scaling well with that, possibly causing these issues observed. Would it be feasible for you to over provision Vector? like at 2x or 2.25x what it was from the recent tests. In the last screen shot if there were roughly 60 from the autoscaling, try 120 or 130. This would help us understand if that theory has legs and if it does then we can shift focus to optimizing it.

jonwinton commented 1 year ago

@neuronull yeah I can do that, but it might need to wait until tomorrow. Just to confirm the test: we want to run it at steady state for a set amount of time scaled up to see if the CPU usage and error rates return to normal?

And just a heads up, we still have two larger envs 😬 Also, would it be possible to backport the interval fix (referenced here) while we continue to work on this? The interval issue is causing a lot of issues for us and it would be great to have a fix for that 🤞

neuronull commented 1 year ago

yeah I can do that, but it might need to wait until tomorrow.

Thanks! No worries.

Just to confirm the test: we want to run it at steady state for a set amount of time scaled up to see if the CPU usage and error rates return to normal?

Yes, the key being to over provision by roughly 2x. If it auto scales up beyond that that is ok but the idea is exactly like you said, see if the CPU usage and errors / metric hits return to normal with having 2x or more Vector instances.

Also, would it be possible to backport the interval fix (https://github.com/vectordotdev/vector/issues/15292#issuecomment-1372477689) while we continue to work on this? The interval issue is causing a lot of issues for us and it would be great to have a fix for that 🤞

What release are you looking to have that backported into ? v0.23 ? cc @jszwedko

jonwinton commented 1 year ago

What release are you looking to have that backported into ? v0.23 ?

Yeah that would be perfect!

neuronull commented 1 year ago

We're taking this opportunity to create a process for this kind of request. Wherein, we would like to publish a build based off of a branch, but don't need to go through the complete patch release cycle. As it is, it's a manual process and delicate to get a system into the correct configuration to generate the build. (creating the branch with the necessary changes is trivial).

So we are working on making that a github action, to utilize the CI infra we have. Then we could essentially automate the build process down to just a few button clicks, once a git branch is pushed to remote.

Currently shooting for next week to accomplish that 🙏

neuronull commented 1 year ago

An update @jonwinton - we will still be pursing the automation of these one off builds, but won't be able to get to it in the short term so I'll go ahead and spin a build that backports the interval fix to the last patch release of v0.23 in the meantime. Hopefully can get that today, if not, early next week.

Are you still planning to try out over provisioning vector? Cheers~

neuronull commented 1 year ago

Ok, here is the build. It is from this branch:

https://github.com/vectordotdev/vector/compare/v0.23.3...neuronull/backport_dd_metrics_interval_fix

timberio/vector:2023-01-27_backport_dd_metrics_interval_fix-distroless-libc

jonwinton commented 1 year ago

🙇 Sorry for the delay @neuronull ! I appreciate backporting this 🙇

I'm definitely still planning to 2x the env and run the test, just got caught up in a migration and went heads down on that. I'll be able to come back to this on Wednesday or Thursday of this week!

jonwinton commented 1 year ago

Ok! Working on this now!

jonwinton commented 1 year ago

@neuronull here are the findings:

Starting Image: timberio/vector:2023-01-27_backport_dd_metrics_interval_fix-distroless-libc Starting Replicas: 19

Test Image: timberio/vector:86635f66b_2023-01-12-distroless-libc (it was the most recent change) Final Replica Count: 700

Resource Requests:

resources:
  limits:
    cpu: "2"
    memory: 1Gi
  requests:
    cpu: "1"
    memory: 1Gi

Screenshot 2023-02-09 at 8 42 07 AM

I started at 150 and error rate was really high
Bumped by 100 replicas every couple of minutes. Each bump reduced error rate
See charts below for how CPU usage changed

Screenshot 2023-02-09 at 8 44 20 AM

In this timeline, we scaled up with the old image, CPU drops
New image comes in, CPU bumps
Each bump of replica count reduces CPU
Per-pod chart is below

Screenshot 2023-02-09 at 8 45 58 AM

From here we do see that the top CPU consuming pods stop bursting above their CPU limit (2vCPU) as we scale, which makes sense.

neuronull commented 1 year ago

Thanks a bunch @jonwinton ! This is essentially what we expected to see.

I'll dive into the performance of that algorithm.

neuronull commented 1 year ago

I'll dive into the performance of that algorithm.

FYI, this is queued for next week, as I have other priorities in flight.

neuronull commented 1 year ago

Hi @jonwinton , apologies for the lengthy silence!

I have a fix candidate build available, which contains optimizations to that algorithm. The performance varies depending on the shape of the input data, but the improvements are up to roughly 66% in some cases.

Very curious to hear if this alleviates the issues you are seeing. Would you mind testing it out when you have time?

timberio/vector:2023-03-03_optimize_dd_metrics_aggregation-distroless-libc

jonwinton commented 1 year ago

@neuronull amazing! Thanks for this! I'm going oncall for our team tomorrow and will definitely test it out then 😬

neuronull commented 1 year ago

@neuronull amazing! Thanks for this! I'm going oncall for our team tomorrow and will definitely test it out then 😬

Hey @jonwinton ! Have you had a chance to try that out? Very eager to hear the results 🤓 .

jonwinton commented 1 year ago

Hey! Sorry for that delay, the DD outage last week threw off a lot of things 😬

So some results...

I bumped HPA to 200 before the test so that we'd start with a nigh number of pods. Once the new image was deployed, CPU was double that of the old implementation. It took scaling to 400 pods for there to be equity between the two versions in terms of resource utilization. At the tail end, I bumped to 600 pods and we can see it drops a little more.

But during this time, error rates stayed elevated in terms of 😭 Datadog Agent --> Vector. I even bumped HAProxy from 9 to 40 pods.

It's important to keep in mind that in this env, we only run 20 vector pods right now

neuronull commented 1 year ago

Whoa... intriguing results! The shape of the input data definitely has an effect on the algorithm's performance, so one thing we might want to do is get a sample of data, if that were possible, so that I can profile the function with that input data.

But before that, a couple of things to check- if CPU throttling is going on (that would be reported in a cpu.throttled metric in Datadog. Additionally, do you have the env var VECTOR_THREADS set, or the --threads arg passed in for vector?

neuronull commented 1 year ago

Another thing to try- you could set VECTOR_THREADS to match the value in your resources.limits.cpu.

neuronull commented 1 year ago

Hi @jonwinton !

Aside from the above queries/settings in Vector to try, if those don't improve anything, we can try capturing some data from the Agent log, to see what shape of data is coming into Vector. This should be an easier thing to try than plugging in a packet capture between the Agent and Vector.

The steps are basically:

add this setting to your Agent config: log_payloads: true.
enable Debug log level in the Agent config: log_level: debug.
restart the Agent.
If there is a ton of volume ingested by this Agent, then the debug log level could fill the max log size pretty quickly. Depending on your environment, you can either specify a log file in the Agent config log_file: ./my_agent_debug.log , or you can send an Agent Flare to Datadog.

This might give me enough to work with, without having to go to the more involved step of a packet capture.

Thanks again for all your assistance and patience here!

4wdonny commented 1 year ago

We're running into what I believe is also this issue. Following along for updates.

neuronull commented 1 year ago

We're running into what I believe is also this issue. Following along for updates.

Hi @4wdonny , thanks for letting us know.

Are you using the same components, HaProxy, and large volume? Just wondering if there is anything in your setup that differs.

aashery-square commented 10 months ago

@neuronull 👋 I've picked up this bug from Jon Winton at Cash. Since it looks like there's a fix out for this, would you be able to provide us with a test image containing the fix that we can demo?

jszwedko commented 10 months ago

@neuronull 👋 I've picked up this bug from Jon Winton at Cash. Since it looks like there's a fix out for this, would you be able to provide us with a test image containing the fix that we can demo?

Hey! We'd definitely be interested to know if this fix resolves it for you too. Would you be able to try the latest nightly build? It will include this change.

vectordotdev / vector