micrometer-metrics / micrometer

An application observability facade for the most popular observability tools. Think SLF4J, but for observability.
https://micrometer.io
Apache License 2.0
4.48k stars 992 forks source link

Support Datadog HTTP API v2 #5469

Open kyungjun-pe opened 2 months ago

kyungjun-pe commented 2 months ago

Due to various issues when transmitting metrics, I contacted datadog and was told to upgrade the API version.

So, after checking to resolve the issue, I found that datadog-related APIs were hard-coded into the code.

code line

Are you planning to upgrade the API version? The recommended content is to transfer to /api/v2/series rather than /api/v1/series.

Related information will be attached below. Datadog says that transmission efficiency and stability have improved due to the upgrade.


POST https://api.datadoghq.com/api/v1/series
The metrics end-point allows you to post time-series data that can be graphed on Datadog’s dashboards. The maximum payload size is 3.2 megabytes (3200000 bytes). Compressed payloads must have a decompressed size of less than 62 megabytes (62914560 bytes).

If you’re submitting metrics directly to the Datadog API without using DogStatsD, expect:

- 64 bits for the timestamp
- 64 bits for the value
- 40 bytes for the metric names
- 50 bytes for the timeseries
- The full payload is approximately 100 bytes. However, with the DogStatsD API, compression is applied, which reduces the payload size.

-----

POST https://api.datadoghq.com/api/v2/series
The metrics end-point allows you to post time-series data that can be graphed on Datadog’s dashboards. The maximum payload size is 500 kilobytes (512000 bytes). Compressed payloads must have a decompressed size of less than 5 megabytes (5242880 bytes).

If you’re submitting metrics directly to the Datadog API without using DogStatsD, expect:

- 64 bits for the timestamp
- 64 bits for the value
- 20 bytes for the metric names
- 50 bytes for the timeseries
- The full payload is approximately 100 bytes.
jonatan-ivanov commented 2 months ago

This might be the first time we have an issue about adding support to v2. I think it is possible to use v2 but I would clearly separate your issue from moving to v2 since lots of users have been using this registry through the V1 API without transmission issues for years. Also we need to investigate what impact does this have, what are the benefits/drawbacks and the effort.

It might also be the case that using v2 will not solve your issue but Datadog support wants to move users to the v2 API. Based on the data you provided, I'm not sure I see an improvement, quite the opposite:

I don't see how this results in better transmission efficiency and stability. Could you please tell us more about the issue? Are you hitting a rate limit, or does Datadog say the payload is too big or it just silently drops your data? Is there any errors you see on Micrometer side?

There is a batch size parameter in Micrometer you can play with: https://github.com/micrometer-metrics/micrometer/blob/9917c67363f4507291b407edea57aebd2508b950/micrometer-core/src/main/java/io/micrometer/core/instrument/push/PushRegistryConfig.java#L89-L91

If you are hitting a throughput rate limit, you can try to increase it (less but bigger requests), if you hit a payload size limit, try to decrease it (more but smaller requests). You can also try to modify DatadogMeterRegistry (use v2) but right now I'm quite sceptic about that it would solve the issue. There is also the StatsD registry which will need an extra component in your infra.

kyungjun-pe commented 2 months ago

We received the following error while using:

[] [datadog-metrics-publisher] ERROR i.m.d.DatadogMeterRegistry - failed to send metrics to datadog: Unable to read payload i/o timeout

So I talked to a datadog engineer about the related issues and he recommended using v2.

We are currently sending directly to datadog via this library without statsD in spring project.

Also, you can enter a uri in the spring resource option, but it seems to be processed as a url in the actual code, which is a bit confusing.

jonatan-ivanov commented 2 months ago

So since this is a timeout, you can:

  1. Increase the timeout (you can configure the timeout in HttpSender)
  2. Increase the batch size (smaller request might take less time for Datadog to process but it also means a bit more overhead to them)

So I talked to a datadog engineer about the related issues and he recommended using v2.

If the suggestions above don't help, I would talk to Datadog support instead of engineers. Again, there is no guarantee that moving to v2 will solve your issue but you can try this out on your own if you really want to by modifying the datadog registry.

Also, you can enter a uri in the spring resource option, but it seems to be processed as a url in the actual code, which is a bit confusing.

I'm not sure I understand this.

kyungjun-pe commented 2 months ago

This timeout error does not occur once but occurs for several tens of minutes.

I don't think this problem can be solved by increasing the timeout period.

And when this error occurred, datadog said that there were no metrics received.

Also, you can enter a uri in the spring resource option, but it seems to be processed as a url in the actual code, which is a bit confusing.

I think it should be host, not uri. This is because internally, only the host part is received and the path part is hard-coded.

I think it would be a good idea to use v1 by default and allow users to choose the version.

shakuzen commented 2 months ago

I think it would be a good idea to use v1 by default and allow users to choose the version.

I think we would have to do it that way since metric name size is more restricted in the v2 API, according to what you shared. This means previously valid metric names will become invalid when switching to the v2 API and need to be truncated. That's going to break dashboards and metrics queries for users that have metrics that end up truncated.

I've marked the issue as an enhancement request to support the v2 API. We'll have to check on any other changes, and as discussed above, we would need to retain support for v1 by default. A pull request would be welcome if someone wants to work on this.

jonatan-ivanov commented 2 months ago

This timeout error does not occur once but occurs for several tens of minutes.

This seems like a backend or network error to me. Still not sure moving to v2 will fix it.

I don't think this problem can be solved by increasing the timeout period.

Maybe but I'm curious what makes you think that. Did you try?

And when this error occurred, datadog said that there were no metrics received.

If this is a network issue or an issue with an API Gateway/Load Balancer, this makes sense.

I think it should be host, not uri. This is because internally, only the host part is received and the path part is hard-coded. I think it would be a good idea to use v1 by default and allow users to choose the version.

I see, I think uri is used instead of host because that way you can also specify the protocol (http vs. https):

If you need to publish metrics to an internal proxy en route to datadoghq, you can define the location of the proxy with this.

kyungjun-pe commented 2 months ago

This timeout error does not occur once but occurs for several tens of minutes.

However, while this problem occurred, there were no problems with our network and Datadog said there were no problems.