usegalaxy-eu / infrastructure-playbook

Ansible playbook for managing UseGalaxy.eu infrastructure.
MIT License
16 stars 91 forks source link

Update Grafana to version 11 and provision dashboards from usegalaxy-eu/grafana-dashboards #1235

Closed kysrpex closed 1 month ago

kysrpex commented 2 months ago

Update Grafana to version 11 and switch from deprecated role cloudalchemy.grafana to the official Grafana role from the grafana.grafana collection. Update usegalaxy_eu.grafana_matrix_forwarder. Instead of disabling firewalld on the Grafana host, open the nginx ports.

Before merging this PR, the Grafana database must be migrated from SQLite to PostgresSQL. This is rather simple using pgloader and following this guide from Jamie Ly. The postgres database information can be found here. It can be accessed from stats, sn06 and maintenance (see usegalaxy-eu/infrastructure-playbook#1235).

Migration in a nutshell:

  1. Make sure the grafana database exists in Postgres and the grafana user has all privileges on it. You may run reset.sql as postgres to do this ( :warning: this drops the grafana database).
  2. Run schema.sql as grafana to create all tables Grafana v9.2.10 (c37dcaf0da) needs. Grafana v11 will run database migrations on top of it.
  3. Install pg_loader. Luckily, there are packages available in the Postgres community repository. To enable the repo, you just need the right repository package. You may (and should) delete the repository package once you have completed this migration.
  4. Open grafana.load and adjust the path to /data/monitoring/grafana_data/grafana.db (if running the migration steps on stats.galaxyproject.eu) and the PostgreSQL connection string if needed.
  5. Run pgloader grafana.load. Warnings and even errors are expected, not an issue.
  6. Merge this PR and run the grafana.yml playbook.
  7. Enable the stats-grafana Jenkins project.
  8. Some dashboards (the provisioned ones) will be in the wrong folder. Move them to the correct folder (see pictures below). All the remaining dashboards that do not belong to an existing folder should be moved to a new folder "General".

Bildschirmfoto vom 2024-07-15 15-21-52 Bildschirmfoto vom 2024-07-15 15-22-19 Bildschirmfoto vom 2024-07-15 15-22-36

Closes usegalaxy-eu/issues#558.

bgruening commented 1 month ago

Cool beans! Thanks a lot!

I have a technical question, but please deploy. Old links to dashboards will probably not work anymore, I assume, is there a way in grafana to keep stable links? Or should we use bitly or even better gx.io to create stable links dashboards?

kysrpex commented 1 month ago

Cool beans! Thanks a lot!

I have a technical question, but please deploy. Old links to dashboards will probably not work anymore, I assume, is there a way in grafana to keep stable links? Or should we use bitly or even better gx.io to create stable links dashboards?

With this migration path all dashboards keep the same uids, and Grafana has not changed how it forms the urls, so links do not break (see screenshots below).

Bildschirmfoto vom 2024-07-16 09-01-57 Bildschirmfoto vom 2024-07-16 09-02-13

bgruening commented 1 month ago

Oh nice!!!

bgruening commented 1 month ago

Can you please also sum that all up in a nice blog post and in addition in a operations update.

Thanks a lot!

mira-miracoli commented 1 month ago

oh I guess my github was stuck

kysrpex commented 1 month ago

@sanjaysrikakulam @mira-miracoli Important detail. You may find that some panels do not work. You can very easily fix them like this:

  1. Find a panel that does not work. Bildschirmfoto vom 2024-07-23 15-59-20

  2. Click "Edit". Bildschirmfoto vom 2024-07-23 15-59-34

  3. Click the pencil on the bottom right (raw query mode). The preview will start working instantly. Bildschirmfoto vom 2024-07-23 16-00-01 Bildschirmfoto vom 2024-07-23 16-00-20

  4. Save the dashboard (the screenshot is wrong, use "Save", not "Apply"). Bildschirmfoto vom 2024-07-23 16-00-34

Don't ask me why this works. Actually if you switch the panel back to visual editor mode it breaks again.

bgruening commented 1 month ago

This one https://stats.galaxyproject.eu/d/000000004/galaxy?orgId=1&refresh=10s&viewPanel=7 does not seem to work. Maybe wrong host?

kysrpex commented 1 month ago

This one https://stats.galaxyproject.eu/d/000000004/galaxy?orgId=1&refresh=10s&viewPanel=7 does not seem to work. Maybe wrong host?

The list of hosts is populated by the InfluxQL query SHOW TAG VALUES FROM "cluster.queue" WITH KEY = "host", but the only possible outcome of that query right now is "maintenance.galaxyproject.eu", see the InfluxDB measurement query below.

> SELECT * FROM "cluster.queue" ORDER BY time desc LIMIT 10
name: cluster.queue
time                 count engine host                         schedd                state
----                 ----- ------ ----                         ------                -----
2024-07-24T12:38:00Z 0     condor maintenance.galaxyproject.eu sn06.galaxyproject.eu suspended
2024-07-24T12:38:00Z 539   condor maintenance.galaxyproject.eu sn06.galaxyproject.eu running
2024-07-24T12:38:00Z 426   condor maintenance.galaxyproject.eu sn06.galaxyproject.eu idle
2024-07-24T12:38:00Z 2     condor maintenance.galaxyproject.eu sn06.galaxyproject.eu held
2024-07-24T12:38:00Z 0     condor maintenance.galaxyproject.eu sn06.galaxyproject.eu removed
2024-07-24T12:38:00Z 0     condor maintenance.galaxyproject.eu sn06.galaxyproject.eu completed
2024-07-24T12:37:00Z 0     condor maintenance.galaxyproject.eu sn06.galaxyproject.eu removed
2024-07-24T12:37:00Z 0     condor maintenance.galaxyproject.eu sn06.galaxyproject.eu completed
2024-07-24T12:37:00Z 0     condor maintenance.galaxyproject.eu sn06.galaxyproject.eu suspended
2024-07-24T12:37:00Z 543   condor maintenance.galaxyproject.eu sn06.galaxyproject.eu running

Changing to SHOW TAG VALUES FROM "cluster.queue" WITH KEY = "schedd" allows choosing sn06.galaxyproject.eu too, and then the route timings work.

This more likely has to do with the setup of the maintenance node and the HTCondor migration than with the Grafana update.

Note to all: if you see fixable problems like these on the dashboards go ahead and fix them.