grafana / timestream-datasource

Amazon Timestream in Grafana
https://grafana.com/grafana/plugins/grafana-timestream-datasource
Apache License 2.0
23 stars 19 forks source link

Getting 503 on Plugin setup #268

Closed lesh3000 closed 9 months ago

lesh3000 commented 9 months ago

Hi, I am following the instructions to setup the plugin on AWS account. However I get plugin is not available error once I try to select the database image I added Timestream permissions to Grafana Fargate task. Plugin version 2.8.0

171aldash commented 9 months ago

I am seeing the same error in my Grafana instance.

Grafana version: OSS 9.3.0 Timestream plugin version: 2.7.1 & 2.8.0 verified as broken

The data source is provisioned like the following:


deleteDatasources:
  - name: Timestream Telegraf
    orgId: 1

datasources:
  - name: Timestream Telegraf
    type: grafana-timestream-datasource
    access: proxy
    orgId: 1
    jsonData:
      authType: default
      defaultRegion: us-east-1
      endpoint: https://query-cell2.timestream.us-east-1.amazonaws.com
      database: monitoring
      table: cpu
      measure: usage_user
    readOnly: false

Grafana runs on an EC2-backed RKE cluster and assumes the IAM role of the EC2 RKE worker node the pod resides on. The IAM role that is assumed has access to Timestream and sts:AssumeRole for the instance is allowed.

As can be seen above, I specify a custom VPC endpoint. I have verified that DNS resolution works from the container shell. Additionally, I have ran commands from the container shell to query timestream via the AWS CLI V2. However, I am not saying this reveals much as Grafana is using AWS SDK for Go rather than the AWS CLI.

I have set the Grafana logs to debug, but not much has been revealed as to what is going wrong. I have observed that an AWS session is successfully created using the AWS credentials chain so there is nothing wrong with assuming the role. See logs below:

logger=secrets.kvstorelevel=debug msg="got secret value" orgId=1 type=datasource namespace="Timestream Telegraf" logger=plugin.grafana-timestream-datasourcelevel=debug msg="Authenticating towards AWS with default SDK method" region=us-east-1 logger=plugin.grafana-timestream-datasourcelevel=debug msg="Successfully created AWS session" logger=ngalert.schedulerlevel=debug msg="No changes detected. Skip updating" logger=secretslevel=debug msg="Removing expired data keys from cache..." logger=secretslevel=debug msg="Removing expired data keys from cache finished successfully" logger=ngalert.state.managerlevel=debug msg="Recording state cache metrics" now=2023-11-16T13:26:13.650504741Z logger=ngalert.sender.routerlevel=debug msg="Attempting to sync admin configs" count=0 logger=ngalert.sender.routerlevel=debug msg="Finish of admin configuration sync" logger=ngalert.multiorg.alertmanagerlevel=debug msg="synchronizing Alertmanagers for orgs" logger=alertmanager org=1level=debug msg="neither config nor template have changed, skipping configuration sync." logger=ngalert.multiorg.alertmanagerlevel=debug msg="done synchronizing Alertmanagers for orgs" logger=provisioning.dashboard type=file name=defaultlevel=debug msg="Start walking disk" path=/etc/grafana/provisioning/dashboards logger=ngalert.schedulerlevel=debug msg="No changes detected. Skip updating" logger=ngalert.state.managerlevel=debug msg="Recording state cache metrics" now=2023-11-16T13:26:28.650386713Z logger=ngalert.schedulerlevel=debug msg="No changes detected. Skip updating" logger=ngalert.schedulerlevel=debug msg="No changes detected. Skip updating" logger=ngalert.state.managerlevel=debug msg="Recording state cache metrics" now=2023-11-16T13:26:43.650140768Z logger=provisioning.dashboard type=file name=defaultlevel=debug msg="Start walking disk" path=/etc/grafana/provisioning/dashboards logger=ngalert.schedulerlevel=debug msg="No changes detected. Skip updating" logger=ngalert.state.managerlevel=debug msg="Recording state cache metrics" now=2023-11-16T13:26:58.649975614Z logger=ngalert.schedulerlevel=debug msg="No changes detected. Skip updating" logger=context userId=2 orgId=1 uname="<<RETRACTED>>"level=error msg="Plugin health check failed" error="failed to check plugin health: health check failed" remote_addr=<<RETRACTED>> traceID= logger=context userId=2 orgId=1 uname="<<RETRACTED>>"level=error msg="Request Completed" method=GET path=/api/datasources/53/health status=500 remote_addr=<<RETRACTED>> time_ms=60000 duration=1m0.000704608s size=53 referer=https://<<RETRACTED>>/datasources/edit/P14CCD1D3897504E9 handler=/api/datasources/:id/health

The error logs for the data source can be observed at the bottom.

lesh3000 commented 9 months ago

Hi @171aldash It is probably that rafaa needs to be restarted after the plugin installation. I host mine on Fargate, and deploy with Docker. Once I have added plugin into environment variable GF_INSTALL_PLUGINS it works as expected

171aldash commented 9 months ago

Plugins are installed when I build and push the image locally to our private docker repository. I haven't ever needed to restart Grafana once deployed on the RKE cluster for any other plugin so I am not so sure that is the fix.

UPDATE: Additionally, when deploying Grafana it gets a fresh container so the restart occurs then.

kevinwcyu commented 9 months ago

Hi @171aldash

2.7.1 & 2.8.0 verified as broken

Can you let me know which version previously worked for you?

171aldash commented 9 months ago

Hi @kevinwcyu. This is for a new project so I haven't tried with any other versions. I will try out a few older releases this morning and report back here how they go.

171aldash commented 9 months ago

I couldn't get the plugin to work on 6 different versions I have tried. I have chosen to migrate over to AWS Managed Grafana instead.

kevinwcyu commented 9 months ago

Closing as it is no longer an issue or a different path has been taken to workaround it.