Open katebrenner opened 6 months ago
Thank you for reporting this here @katebrenner.
We are still experiencing the issue.
Hi @dcram ! I investigated this, and it seems like a lot of this is related to Athena behavior. The “HTTP 504 Gateway Timeout” comes from AWS's load balancing and I found these docs from AMG about resolving it: https://repost.aws/knowledge-center/grafana-504-timeout-vpc. There is also information about how to tune Athena data and queries to improve the response time https://docs.aws.amazon.com/athena/latest/ug/performance-tuning.html. My understand of the docs is that when the queries are initially run, Athena needs to assign resources, which is why they’re slow for the first query, but improve afterwards. I can look into retrying on the Gateway Timeout, but that won’t fix the underlying issue of the queries initially taking a long time.
We experience something similar but not sure if the links mentioned above describe the issue. For us it depends on what authentication mechanism we use.
Our setup. Grafana deployed as part of prometheus stack on a kubernetes cluster that is setup with Kops. IAM-role on the workers with a policy to access athena\S3 in another AWS account.
Below part of our helm chart to configure the above
grafana:
plugins:
- grafana-athena-datasource
datasources:
datasources.yaml:
apiVersion: 1
datasources:
- name: Athena
type: grafana-athena-datasource
jsonData:
authType: ec2_iam_role
assumeRoleArn: arn:aws:iam::xxxxxxxxxxxxxx:role/yyyyyyyyyyyy
defaultRegion: eu-west-1
catalog: AwsDataCatalog
database: 'ourdatabase'
workgroup: 'primary'
outputLocation: s3://aws-athena-query-results-xxxxxxxxxxxxxxx-eu-west-1/ourlocation
"grafana.ini":
aws:
allowed_auth_providers: default,keys,credentials,ec2_iam_role
When creating a datasource manually in Grafana it always works with (access and secret) keys. But when manually creation a datasource using the workspace IAM role (ec2_iam_role provider) it is impossible to get it working. Looks like this timeout issue is worsened by the order in which values are entered.
So only when creating via automation the iam_role works but with this timeout issue. No issues at all when using the keys provider.
Also it does not matter if we fill in the "Assume Role ARN" field.
And once the dataprovider is working we have never experiences issues with actual queries. Note that we are still trying a first Athena setup. Not sure if\when cached datatesource connections will expire.
We have also tried this with different versions of the prometheus stack and thus trying with Grafana 9, 10 and 11. Although the grafana UI experience is different the timeouts exist in all versions.
@aligthart it's very interesting that you get a 500 when using ec2_iam_role
. Do you hit the same issues if you use default
as your auth provider? I believe default should also pick up on credentials that are on an ec2 instance.
I started with the default provider but then I ran into the same problems. For that reason I explicitly enabled the ec2_iam_role provider and started using the providers explicitly (keys and ec2_iam_role) to have a better control on what was used.
Though not in my helm config above, I also enabled debug mode for the logging. Also I had a look at the github code the the plugin itself.
I noticed some strange code in the plugin when using the ec2_iam_role where it first does an auth request to a hardcoded US region and only later does the real auth request. Not the expert here though....
And there are now access and secret key on our kubernetes worker nodes. So the "keys" auth provider would never work. I only used that for testing/debug purpose. Eventually we are only interesting in a setup with working IAM role while assuming an ARN pointing to another account.
@aligthart I tried spinning up an ec2 instance with grafana 9 and athena 2.17.1, enabling both default and ec2_iam_role, and both auth methods worked for me. So I'm not sure what to make of this.
Do you have more information about when you see these 500s? Do they happen for you when you load the datasource configuration page or when you save the datasource configuration details? Do they happen on the Explore page or in dashboards? Did you look into the vpc help page that @iwysiu linked?
Hi,
Sorry for not being very responsive....
The errors appear on the grafana page where you manually create a datasources. When using the ec2_iam role. I can configure the the assume role and default region But as soon as I try to enter any of the Athena details (datasource, database, workgroup) things go wrong. UI hangs and shows an expection dialog after 1 minute. Gateway timeout.
This does not happen when using "keys". Then all works fine.
To work arround this manual datasource creation, I automated the config in the grafana.ini.
Then the datasource is properly created (with all the settings I want) and eventually (not sure why not instantly) it starts working.
Yes, I did look at the vpc help page. But I do not want to go there right now. Don' think that is the solution for our problem (I do not see how that wiki would explain different behavior based on credential provider).
Maybe one more thing to add on our setup. Our Grafana sits behind an nginx ingress controller which we expose via an AWS NLB.
So this is also in our grafana.ini
grafana.ini: |
[server]
domain = our.domain.com
root_url = https://our.domain.com/grafana
In developer tools I see the grafana datasource doing calls to this endpoint.
Hi @aligthart ! Based on the fact that our default session duration is 15m, and it sounds like you’re able to connect for a full day after it errors in the morning, I’m not sure the issue is in the datasource plugin. We should be expiring sessions every 15 minutes, and they’ll attempt to connect with the same settings every time, so I would expect it to fail every 15m instead of every morning if it was a datasource plugin problem.
Both our prometheus and Athena datasource plugins use the same authentication code, so we can’t use that to determine anything, but we may be able to use the AWS CLI to test where the error is coming from. Can you try configuring the AWS CLI with your IAM role and running the command aws athena list-data-catalogs
? If that gets a timeout, then we know the issue is coming from AWS.
If that doesn’t error, getting the Grafana logs may give us a better idea of what’s happening. If you configure grafana with log level debug and get the logs from the time of the timeouts that could help us reproduce/debug this.
What happened: Users are reporting experiencing slow Athena dashboard loading, on the first loading. "After 5-10 minutes of manual reloads and several 504 Gateway Timeout errors, we finally get all our dashboards working fine for the rest of the day." (https://github.com/grafana/grafana/issues/71946#issuecomment-1968494233) and https://github.com/grafana/athena-datasource/issues/99#issuecomment-1866744050
What you expected to happen: Not this........