tomkerkhove / promitor

Bringing Azure Monitor metrics where you need them.
https://promitor.io
MIT License
255 stars 94 forks source link

Provide support for new Azure Monitor Metrics Dataplane API #2284

Closed tomkerkhove closed 3 weeks ago

tomkerkhove commented 1 year ago

Proposal

Provide support for new Azure Monitor Metrics Dataplane API (preview) which has higher limitations and allows for scraping resources in batch.

https://azure.microsoft.com/en-us/updates/public-preview-azure-monitor-metrics-dataplane-api-released/

This should be configurable so that end-users can choose this new API next to existing ARM API.

Component

Scraper

Contact Details

No response

tomkerkhove commented 9 months ago

This is now GA: https://techcommunity.microsoft.com/t5/azure-observability-blog/azure-monitor-announcing-general-availability-of-azure-monitor/ba-p/4041394#:~:text=So%2C%20what%20is%20Azure%20Monitor,a%20higher%20capacity%20querying%20experience.

tomkerkhove commented 9 months ago

To make it even more accessible to developers across different programming languages, the Azure Metrics Data Plane API now offers client libraries for Java, JavaScript, .NET, Python, and Go languages. This support enables developers to integrate and interact with the API seamlessly in their preferred language, streamlining the development process and making it easier to leverage the power of Azure Metrics.

Resources:

tomkerkhove commented 1 month ago

@hkfgo Can you please also update https://changelog.promitor.io/?

tomkerkhove commented 1 month ago

@hkfgo Can you please see if local tests are working with OpenTelemetry? I tried shipping new version 3 times but all those tests time out for metrics not showing up

https://dev.azure.com/tomkerkhove/Promitor/_build/results?buildId=14883&view=results

hkfgo commented 1 month ago

@hkfgo Can you please see if local tests are working with OpenTelemetry? I tried shipping new version 3 times but all those tests time out for metrics not showing up

https://dev.azure.com/tomkerkhove/Promitor/_build/results?buildId=14883&view=results

Sure, I can test it out in our cloud env. We have OTEL Collectors running there. In my experience it take on average more than 3 tries for the OpenTelemetry CI to pass :/ In the most recent PR I merged, I had to retry 5 times

hkfgo commented 1 month ago

I got some unfortunate news @tomkerkhove, the OTEL Collector sink might actually be broken. I observe one successful metric export then nothing afterwards. That's probably why some times the CI tests were able to pass.

Not sure which version broke OTEL or why no one's filed an issue yet

hkfgo commented 1 month ago

But in any case, I'm working hard to track down the issue..

tomkerkhove commented 4 weeks ago

Thanks! Do you know if it's related to the new API change or broken in general?

hkfgo commented 4 weeks ago

Thanks! Do you know if it's related to the new API change or broken in general?

Unlikely since no there hasn't been a code change on the OTEL sink for a year. Code changes in the past year should have no effect there. I'm running through older versions to pin down the latest version with a working OTEL sink(and thus when it broke).

You are sure the OTEL sink worked smoothly at some point right?

hkfgo commented 4 weeks ago

@tomkerkhove I tracked down the latest version with a working OTEL sink, which is 2.8.0. That also means 2.9.0: https://github.com/tomkerkhove/promitor/releases/tag/Scraper-v2.9.0 was where things broke, over 1.5 years ago. Among the list of changes I suspect one of these two commits broke it: https://github.com/tomkerkhove/promitor/pull/2239 and https://github.com/tomkerkhove/promitor/pull/2235. There was also an upgrade to .Net 7. I'll need a bit more time to pin down the offending change.

Taking a step back.. Given this piece of information, I'd consider all the recent changes safe, since they aren't the ones breaking OTEL sink. I'd also consider the broken OTEL sink not a blocker for release these recent changes.

tomkerkhove commented 4 weeks ago

I am not sure if I agree with that given the tests up to v2.11.2 were passing : https://dev.azure.com/tomkerkhove/Promitor/_build/results?buildId=13845&view=results

tomkerkhove commented 4 weeks ago

Taking a step back.. Given this piece of information, I'd consider all the recent changes safe, since they aren't the ones breaking OTEL sink. I'd also consider the broken OTEL sink not a blocker for release these recent changes.

Not really or we'd break other end-users which is not really what we should do.

hkfgo commented 4 weeks ago
Screenshot 2024-11-01 at 2 56 36 PM

The reason why previous CI were able to pass was because the first export is often successful, but not subsequent ones. The screenshot above should help illustrate the point. That was from running 2.11.2. This failure mode is insidious IMO because it can "trick" the CI to pass, all the while the sink is broken.

Version 2.8.0 was the latest one that gave a continuous stream of metrics.

tomkerkhove commented 3 weeks ago

Let's move this to a seperate issue, but we'll need to fix that because:

  1. End-users should not be broken
  2. No way tests will be passing

If we cannot resolve by EOW,then I'll skip OTEL tests just to get the version out and re-enable them again

hkfgo commented 3 weeks ago

Ack, I'll open a new issue