micrometer-metrics / micrometer

An application observability facade for the most popular observability tools. Think SLF4J, but for observability.
https://micrometer.io
Apache License 2.0
4.45k stars 983 forks source link

DataDog plugin triggers blocking due to infinite metric metadata update request failures #4056

Open platinummonkey opened 1 year ago

platinummonkey commented 1 year ago

fixed in https://github.com/micrometer-metrics/micrometer/pull/3329 but it seems the original responders have gone completely AWOL. Starting this issue purposefully to raise attention.

shakuzen commented 1 year ago

Sorry for the lack of response on the pull request. It's on our radar but we haven't had the bandwidth to review it lately.

As for the title of this issue, could you provide more details? Metadata is sent at the end of each publish for metrics that haven't had metadata successfully sent yet. Where is blocking happening?

platinummonkey commented 1 year ago

We block because micrometer logic infinitely spams in a DoS fashion :)

shakuzen commented 1 year ago

Thanks for the clarification on the blocking. I don't understand the infinitely part still. Once an application has successfully sent metadata for a metric, it does not send it again for the life of that application. Or at least that's the intended behavior. An attempt to send the same metadata should only happen if it did not get a successful response or if the application is restarted or it's a different instance.

platinummonkey commented 1 year ago

Right this can happen from a 429, 404, 403, 401, or 400 it appears to not take into account the upstream response. Regardless back to the other reason for the PR is the current method is inefficient and lossy (in the event of errors) push format compared to dogstatsd backed by the agents retry behavior and optimized intake paths.

shakuzen commented 1 year ago

If there's additional error handling that should be done, we'd appreciate details of that. While I understand your preferred solution is the PR, the fact is all users are not going to instantly switch over to that in a new minor version. So if there is a bug we should fix, we'd be interested in helping users (and it sounds like it could alleviate some unnecessary overhead on the Datadog side). The existing direct-to-datadog approach would not entirely go away any time soon even if the PR is merged, after all.

platinummonkey commented 1 year ago

I think this can suffice as a bug report for that. Feel free to change the issue description, but not currently interested in solving that issue. It’s non optimized for us, using a very very old intake path, lossy and we actually prefer people switch default to the statsd approach. It also comes with additional distribution support.

Is there a specific reason to hold onto the direct to api approach indefinitely that can be clearly articulated? (Aside from the need to migrate)

For the time remaining I am shifting my attention back to #3329 - I’ll remerge upstream and fix the new conflicts.

github-actions[bot] commented 9 months ago

If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 7 days this issue will be closed.

platinummonkey commented 9 months ago

This is still a problem and affecting customers

github-actions[bot] commented 9 months ago

If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 7 days this issue will be closed.

platinummonkey commented 9 months ago

Still the same bump