grafana / timestream-datasource

Amazon Timestream in Grafana
https://grafana.com/grafana/plugins/grafana-timestream-datasource
Apache License 2.0
24 stars 19 forks source link

Pagination problem for HA grafana deployment #58

Closed dmazhar-cogniance closed 3 years ago

dmazhar-cogniance commented 3 years ago

Hi all. We have a grafana deployment with 2 instances and PostgreSQL database for the storage. Grafana session is stored in the DB too. Trying to get paginated response AWS Timestream results in error: ValidationException: Invalid pagination token. Reducing the number of grafana instances to 1 or enabling the sticky sessions on the load balancer resolves the problem, so I think that problem is that pagination requests are hitting different grafana instances. Instances will use different source IP to reach AWS API, if this is important. Please let me know if I should provide more details, I can reproduce the problem. Grafana version: 7.3.7 Timestream datasource plugin version: 1.1.2 Setup: running official dockerhub grafana containers in AWS ECS, 2 instances behind the application load balancer. PostgreSQL's used as a grafana DB.

dmazhar-cogniance commented 3 years ago

It looks like the root cause is not different source IPs, but different IAM credentials. IAM Roles for Tasks are used in our deployment, two instances of grafana use same IAM role, but different IAM tokens obtained from STS, and this is where the things broke :(

ryantxu commented 3 years ago

For HA deployments like this, you will need to make sure that everything in a cluster is configured the same. Let us know if there is something more concrete we should dig into

dmazhar-cogniance commented 3 years ago

@ryantxu , the issue here is that this is the set-up that you'll use with AWS usually. Example for the ECS:

I see two options to workaround this:

  1. Use IAM access/secret keys instead of IAM role, which is not a good practice
  2. Use sticky sessions on ALB, so all the pagination requests from user will hit the same grafana instance. We're using this as a workaround right now, but sticky sessions are smelly...

As for me the issue here is that AWS pagination token is used across different instances of grafana.