Athena query timeout - 504

katebrenner commented 6 months ago

What happened: Users are reporting experiencing slow Athena dashboard loading, on the first loading. "After 5-10 minutes of manual reloads and several 504 Gateway Timeout errors, we finally get all our dashboards working fine for the rest of the day." (https://github.com/grafana/grafana/issues/71946#issuecomment-1968494233) and https://github.com/grafana/athena-datasource/issues/99#issuecomment-1866744050

What you expected to happen: Not this........

dcram commented 6 months ago

Thank you for reporting this here @katebrenner.

We are still experiencing the issue.

iwysiu commented 4 months ago

Hi @dcram ! I investigated this, and it seems like a lot of this is related to Athena behavior. The “HTTP 504 Gateway Timeout” comes from AWS's load balancing and I found these docs from AMG about resolving it: https://repost.aws/knowledge-center/grafana-504-timeout-vpc. There is also information about how to tune Athena data and queries to improve the response time https://docs.aws.amazon.com/athena/latest/ug/performance-tuning.html. My understand of the docs is that when the queries are initially run, Athena needs to assign resources, which is why they’re slow for the first query, but improve afterwards. I can look into retrying on the Gateway Timeout, but that won’t fix the underlying issue of the queries initially taking a long time.

aligthart commented 2 months ago

We experience something similar but not sure if the links mentioned above describe the issue. For us it depends on what authentication mechanism we use.

Our setup. Grafana deployed as part of prometheus stack on a kubernetes cluster that is setup with Kops. IAM-role on the workers with a policy to access athena\S3 in another AWS account.

Below part of our helm chart to configure the above

grafana:
  plugins:
    - grafana-athena-datasource
  datasources:
    datasources.yaml:
      apiVersion: 1
      datasources:
        - name: Athena
          type: grafana-athena-datasource
          jsonData:
            authType: ec2_iam_role
            assumeRoleArn: arn:aws:iam::xxxxxxxxxxxxxx:role/yyyyyyyyyyyy
            defaultRegion: eu-west-1
            catalog: AwsDataCatalog
            database: 'ourdatabase'
            workgroup: 'primary'
            outputLocation: s3://aws-athena-query-results-xxxxxxxxxxxxxxx-eu-west-1/ourlocation
  "grafana.ini":
    aws:
      allowed_auth_providers: default,keys,credentials,ec2_iam_role

with the above setup we also experience these gateway timeout that magically disappear after 5 a few minutes. Not sure what makes it work.
when changing the authentication provider from "ec2_iam_role" to "keys" then all works fine instantly. The datasource instantly starts returning results from the athena in the other AWS account

When creating a datasource manually in Grafana it always works with (access and secret) keys. But when manually creation a datasource using the workspace IAM role (ec2_iam_role provider) it is impossible to get it working. Looks like this timeout issue is worsened by the order in which values are entered.

So only when creating via automation the iam_role works but with this timeout issue. No issues at all when using the keys provider.

Also it does not matter if we fill in the "Assume Role ARN" field.

When using the keys provider and assuming the role we see the Athena in the other AWS account. never a timeout
When using the keys provider and not assuming the role we see the Athena in the local AWS account. never a timeout
when using the IAM role provider we see the Athena account in the other AWS account (once we get passed the timeouts)

And once the dataprovider is working we have never experiences issues with actual queries. Note that we are still trying a first Athena setup. Not sure if\when cached datatesource connections will expire.

We have also tried this with different versions of the prometheus stack and thus trying with Grafana 9, 10 and 11. Although the grafana UI experience is different the timeouts exist in all versions.

sarahzinger commented 2 months ago

@aligthart it's very interesting that you get a 500 when using ec2_iam_role. Do you hit the same issues if you use default as your auth provider? I believe default should also pick up on credentials that are on an ec2 instance.

aligthart commented 2 months ago

I started with the default provider but then I ran into the same problems. For that reason I explicitly enabled the ec2_iam_role provider and started using the providers explicitly (keys and ec2_iam_role) to have a better control on what was used.

Though not in my helm config above, I also enabled debug mode for the logging. Also I had a look at the github code the the plugin itself.

I noticed some strange code in the plugin when using the ec2_iam_role where it first does an auth request to a hardcoded US region and only later does the real auth request. Not the expert here though....

And there are now access and secret key on our kubernetes worker nodes. So the "keys" auth provider would never work. I only used that for testing/debug purpose. Eventually we are only interesting in a setup with working IAM role while assuming an ARN pointing to another account.

sarahzinger commented 2 months ago

@aligthart I tried spinning up an ec2 instance with grafana 9 and athena 2.17.1, enabling both default and ec2_iam_role, and both auth methods worked for me. So I'm not sure what to make of this.

Do you have more information about when you see these 500s? Do they happen for you when you load the datasource configuration page or when you save the datasource configuration details? Do they happen on the Explore page or in dashboards? Did you look into the vpc help page that @iwysiu linked?

aligthart commented 2 months ago

Hi,

Sorry for not being very responsive....

The errors appear on the grafana page where you manually create a datasources. When using the ec2_iam role. I can configure the the assume role and default region But as soon as I try to enter any of the Athena details (datasource, database, workgroup) things go wrong. UI hangs and shows an expection dialog after 1 minute. Gateway timeout.

This does not happen when using "keys". Then all works fine.

To work arround this manual datasource creation, I automated the config in the grafana.ini.

Then the datasource is properly created (with all the settings I want) and eventually (not sure why not instantly) it starts working.

Yes, I did look at the vpc help page. But I do not want to go there right now. Don' think that is the solution for our problem (I do not see how that wiki would explain different behavior based on credential provider).

Maybe one more thing to add on our setup. Our Grafana sits behind an nginx ingress controller which we expose via an AWS NLB.

So this is also in our grafana.ini

  grafana.ini: |
    [server]
    domain = our.domain.com
    root_url = https://our.domain.com/grafana

In developer tools I see the grafana datasource doing calls to this endpoint.

iwysiu commented 2 months ago

Hi @aligthart ! Based on the fact that our default session duration is 15m, and it sounds like you’re able to connect for a full day after it errors in the morning, I’m not sure the issue is in the datasource plugin. We should be expiring sessions every 15 minutes, and they’ll attempt to connect with the same settings every time, so I would expect it to fail every 15m instead of every morning if it was a datasource plugin problem.

Both our prometheus and Athena datasource plugins use the same authentication code, so we can’t use that to determine anything, but we may be able to use the AWS CLI to test where the error is coming from. Can you try configuring the AWS CLI with your IAM role and running the command aws athena list-data-catalogs? If that gets a timeout, then we know the issue is coming from AWS.

If that doesn’t error, getting the Grafana logs may give us a better idea of what’s happening. If you configure grafana with log level debug and get the logs from the time of the timeouts that could help us reproduce/debug this.

grafana / athena-datasource

Athena query timeout - 504 #319