moxious / triage

testing triage actions for issues
0 stars 1 forks source link

Azure Monitor: queries returns "Client.Timeout exceeded while awaiting headers" #248

Open tonypowa opened 6 months ago

tonypowa commented 6 months ago

What happened?

We have a TV showing a dashboard that contains 10 panels, for a total of 15 queries.

Sometimes, queries from Azure Monitor returns status code 504, Client.Timeout exceeded while awaiting headers.

During these occurrences, we are not able to use Grafana to monitor our applications and infrastructure for anywhere between 5 to 10 minutes.

The problem concerns every other dashboard that uses the Azure Monitor datasource.

We tried to increase the timeout settings to 60s, without any success.

It is possible to recreate the anomaly by clicking multiple time on refresh button of the dashboard

Grafana is running on an AppService on Azure, running the DockerHub container.

What did you expect to happen?

Data should be available without any downtime

Did this work before?

Worked when we started using Grafana, version 8.3.7. The problem is gradually increasing since then.

How do we reproduce it?

  1. Create a dashboard that contains multiple panels with the Azure Monitor datasource

  2. Repeatedly click on the refresh button of the dashboard

  3. Panels should load until getting error 504 Client.Timeout exceeded while awaiting headers

Is the bug inside a dashboard panel?

No response

Environment (with versions)?

Grafana: 10.2.2

OS: Windows / Linux

Browser: Opera, Chrome

Grafana platform?

Docker

Datasource(s)?

Azure Monitor

tonypowa commented 6 months ago

This issue is a test copy of an issue in another repo. Original issue: https://github.com/grafana/grafana/issues/#80536

moxious commented 6 months ago

Summary: User tonypowa reports an issue with Azure Monitor queries timing out with a status code 504 on Grafana dashboards, causing a 5-10 minute downtime. Increasing timeout settings hasn't resolved the issue, which also affects other dashboards using the Azure Monitor datasource.

moxious commented 6 months ago

Hi @tonypowa, it seems like this issue is related to the Azure Monitor integration with Grafana. Therefore, this should likely be addressed by the Azure Datasources project team for investigation and resolution given the specificity to Azure Monitor as a datasource.

moxious commented 6 months ago

Elaboration:

Hi @tonypowa, thank you for providing details about the timeout issues you're experiencing with Azure Monitor queries in Grafana. To help you better, we need a bit more information. Could you please answer the following questions in as much detail as possible? This will help us understand the scope and scale of the issue, allowing for a more precise diagnosis and assistance.

  1. Frequency of Error: How often are you facing this issue? Is it happening intermittently or consistently?

  2. Query Details: Can you provide a sample of the query or queries being run when the 504 error occurs? Please make sure to redact any sensitive information.

  3. Network Latency: Have you checked the network latency between your Grafana instance and Azure Monitor during the time these errors occur?

  4. Grafana Logs: Are there any relevant logs from Grafana around the time the timeouts occur? These might help identify any potential bottlenecks or errors in the system.

  5. Azure Service Health: Have you looked at Azure Service Health or other monitoring to check if there are any known issues with Azure Monitor services themselves during the times you're seeing the 504 errors?

  6. Concurrent Usage: How many users are typically using the dashboards when these errors occur?

  7. Resource Availability: Could you provide information about the resources allocated to the AppService running Grafana? This might be related to the application’s ability to handle the load.

  8. Azure Monitor Metrics: Have you observed any patterns in Azure Monitor metrics, such as an increase in API call latency or errors, that correlate with the timeouts you're encountering?

  9. Grafana Updates: Has there been any change in behavior with different versions of Grafana after 8.3.7?

  10. Screenshots or Videos: Is it possible to include a screenshot or video of what happens when the error occurs? Visual context always helps.

Once you've provided the information above, we can investigate this further. Thank you for your cooperation and for being part of the community!