tomkerkhove / promitor

Bringing Azure Monitor metrics where you need them.
https://promitor.io
MIT License
249 stars 91 forks source link

Broken and discontinued Azure PaaS metric series shown in Grafana UI #2090

Open tenutensing opened 2 years ago

tenutensing commented 2 years ago

Report

Hi there,

I've been trying to pull the Azure PaaS metric via promitor exporter. The metrics are successfully scraped, and pushed to Grafana UI. But unfortunately the metrics shown is not continuous, and is broken intermittently.

Have any one faced this issue before? I'm using a bit older version of the charts. Screenshot attached for reference.

Kindly help.

promitor-agent-resource-discovery
Version: 0.6.0

promitor-agent-scraper
Version: 2.5.1

Regards, Tenu

promitor_discontinued_metric_series

Expected Behavior

Expecting the metric series to be continuous, without missing the scrape points. Screenshot attached.

promitor_continuous_metric_series

This screenshot was taken during the time of Proof of concept task. It was working fine then.

Actual Behavior

Metric series are broken intermittently.

promitor_discontinued_metric_series

Steps to Reproduce the Problem

Chart versions:

promitor-agent-resource-discovery
Version: 0.6.0

promitor-agent-scraper
Version: 2.5.1

Component

Scraper

Version

2.5.1

Configuration

Configuration:

# Add your scraping configuration here

Logs

example

Platform

Microsoft Azure

Contact Details

tenutensing@gmail.com

tomkerkhove commented 2 years ago

Would you mind trying to use our latest version and share the outcome please?

Also, it's hard to tell without the configuration of Promitor, can you share it please?

tenutensing commented 2 years ago

Sure. I can try with the latest version and check. Just wanted to understand the root cause for this behavior first. If this is something caused due to configuration mistake from my end or not. This was previously working, when we did the POC.

As requested, attaching herewith the configuration.

FYI : I created a parent chart, which has dependency on "promitor-agent-resource-discovery" and "promitor-agent-scraper", and installed both charts in a single step.

Parent chart - monitoring-promitor

Dependent charts - promitor-agent-scraper promitor-agent-resource-discovery

configuration.zip

tenutensing commented 2 years ago

I tried to scrape a single metric from a single PaaS to analyze the behavior. I don't see the broken scrape points in this scenario.

Whereas while configuring the scraper to collect multiple metrics from multiple PaaS, the scrape points are discontinuous.

Comparison

Multiple

Pod logs attached along for reference.

all_metrics0.txt single_metrics0.txt

Sample traces in pod logs:

_[11:16:42 FTL] Failed to scrape resource for metric 'storageaccount_availability' System.Threading.Tasks.TaskCanceledException: The operation was canceled. ---> System.IO.IOException: Unable to read data from the transport connection: Operation canceled. ---> System.Net.Sockets.SocketException (125): Operation canceled --- End of inner exception stack trace --- at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken) at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.GetResult(Int16 token) at System.Net.Security.SslStream.g__InternalFillBufferAsync|2150[TReadAdapter](TReadAdapter adap, ValueTask1 task, Int32 min, Int32 initial) at System.Net.Security.SslStream.ReadAsyncInternal[TReadAdapter](TReadAdapter adapter, Memory1 buffer) at System.Net.Http.HttpConnection.FillAsync() at System.Net.Http.HttpConnection.ReadNextResponseHeaderLineAsync(Boolean foldedHeadersAllowed) at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, CancellationToken cancellationToken) --- End of inner exception stack trace --- at Microsoft.Rest.RetryDelegatingHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken) at Microsoft.Azure.Management.ResourceManager.Fluent.Core.ProviderRegistrationDelegatingHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken) at Promitor.Agents.Core.RequestHandlers.ThrottlingRequestHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken) in /src/Promitor.Agents.Core/ThrottlingRequestHandler.cs:line 48 at Microsoft.Azure.Management.ResourceManager.Fluent.Core.UserAgentDelegatingHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken) at System.Net.Http.HttpClient.FinishSendAsyncBuffered(Task1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts) at Microsoft.Azure.Management.Monitor.Fluent.MetricDefinitionsOperations.ListWithHttpMessagesAsync(String resourceUri, String metricnamespace, Dictionary2 customHeaders, CancellationToken cancellationToken) at Microsoft.Azure.Management.Monitor.Fluent.MetricDefinitionsOperationsExtensions.ListAsync(IMetricDefinitionsOperations operations, String resourceUri, String metricnamespace, CancellationToken cancellationToken) at Microsoft.Azure.Management.Monitor.Fluent.MetricDefinitionsImpl.ListByResourceAsync(String resourceId, CancellationToken cancellationToken) at Microsoft.Azure.Management.Monitor.Fluent.MetricDefinitionsImpl.Microsoft.Azure.Management.Monitor.Fluent.IMetricDefinitions.ListByResourceAsync(String resourceId, CancellationToken cancellationToken) at Promitor.Integrations.AzureMonitor.AzureMonitorClient.QueryMetricAsync(String metricName, String metricDimension, AggregationType aggregationType, TimeSpan aggregationInterval, String resourceId, String metricFilter, Nullable1 metricLimit) in /src/Promitor.Integrations.AzureMonitor/AzureMonitorClient.cs:line 74 at Promitor.Core.Scraping.AzureMonitorScraper1.ScrapeResourceAsync(String subscriptionId, ScrapeDefinition1 scrapeDefinition, TResourceDefinition resourceDefinition, AggregationType aggregationType, TimeSpan aggregationInterval) in /src/Promitor.Core.Scraping/AzureMonitorScraper.cs:line 72 at Promitor.Core.Scraping.Scraper1.ScrapeAsync(ScrapeDefinition`1 scrapeDefinition) in /src/Promitor.Core.Scraping/Scraper.cs:line 103

tomkerkhove commented 2 years ago

Thanks for testing!

tenutensing commented 2 years ago

Appreciate if you could help us understand the root cause of the same. Our objective is to monitor multiple PaaS resources/metrics via promitor exporter. With this missing scrape endpoints, we're unable to promote this feature to production.

tomkerkhove commented 2 years ago

I will take a look but it looks like requests are being timed out so I'd first make sure that the network is fine.

Also, did you check the system metrics with regards to throttling?

tenutensing commented 2 years ago

Here are the throttling metrics for your reference:-

HELP promitor_ratelimit_arm Indication how many calls are still available before Azure Resource Manager (ARM) is going to throttle us. TYPE promitor_ratelimit_arm gauge promitor_ratelimit_arm{tenant_id="xxx",subscription_id="xxx",app_id="xxx"} 11983 1660032355059

HELP promitor_ratelimit_arm_throttled Indication concerning Azure Resource Manager are being throttled. (1 = yes, 0 = no). TYPE promitor_ratelimit_arm_throttled gauge promitor_ratelimit_arm_throttled{tenant_id="xxx",subscription_id="xxx",app_id="xxx"} 0 1660032355059

tenutensing commented 2 years ago

[UPDATE] I've opened a case with Microsoft wrt this issue, in order to figure out whether this behavior is due to any throttling happening at the Azure Monitor APIs that are being used at the backend of promitor exporter.

Below are MS support team's comments:

+We picked metric 'memory percent' and plotted it via Metrics explorer for last 1 hour, could see continuous time series:

1

+Checked for longer time frame as well- 24 hours, could see continuous time series:

2

+You also mentioned that you tried to scrape a single metric from a single PaaS to analyze the behavior but doesn't see the broken scrape points in this scenario. Whereas while configuring the scraper to collect multiple metrics from multiple PaaS, the scrape points are discontinuous, and issue is seen.

+So, we plotted multiple metrics- CPU percent and memory percent, it looks fine as below:

3

+Also, since you mentioned that promitor is using Rest API to fetch metrics, we tried hitting Azure Monitor API via https://docs.microsoft.com/en-us/rest/api/monitor/metrics/list?tabs=HTTP&tryIt=true&source=docs#code-try-0 which is giving expected output.

Also, confirmed from the back end, that there are no broken lines in the graph.

4

tomkerkhove commented 2 years ago

Hm this is hard to figure out given there are Promitor end-users scraping hundreds of resources 🤔