medic / cht-watchdog

Configuration for deploying a monitoring/alerting stack for CHT
GNU Affero General Public License v3.0
4 stars 7 forks source link

feat(#112): update default service versions #113

Closed jkuester closed 1 month ago

jkuester commented 2 months ago

Update the default versions for Prometheus and Grafana:

There was no major breaking changes to note for the Prometheus upgrade. For Grafana, I reviewed the release notes and upgrade guides. While many things did change, I did not find anything that required a manual migration when updating a Watchdog instance. All of our default configuration seems compatible with the new version of Grafana.

Add SQL_EXPORTER_VERSION envar

Following the example of the other docker images, I am pinning the version of burningalchemist/sql_exporter (to its latest current release) and I have added the SQL_EXPORTER_VERSION envar to the .env.example file as the place where users can configure a custom version of the sql exporter.

Bump node dev dependencies in package.json

I was able to lift the dependencies to their latest versions except for:

The new version of the conventional-commits libraries is no longer compatible with Node 18. So, I set our minimum Node engine config to match the required version of Node 20. I also updated our GitHub Action workflow configs to run with the new Node version.

The new major version of husky included some minor migration steps with our husky config detailed here. I did validate that our git pre-commit hook still works for me locally.

Dependabot

I have added dependabot config (as Andra suggested in the issue). It is pretty straightforward, but @mrjones-plip I think you will still need to use your admin powers to actually enable the bot in the repo settings.... (Unless it has the necessary authorization at the org level... :thinking: )

jkuester commented 2 months ago

in our compose files we pin to a version.... However, that's actually the default value for when the environment variable GRAFANA_VERSION isn't set:

Correct. If you copy-paste the .env.example file and do not edit/comment-out the *_VERSION variables, you will end up getting the latest images. I agree this this is a bit counter-intuitive. (And can easily result in undesirable behavior in that we don't want folks to accidentally run on latest if they don't want to.) At the same time, I don't want to set the pinned version in both the compose file and the env file (since that is extra places to update and more complexity). My proposal is that we comment-out the *_VERSION variables in the .env.example file. They are still in there so it is clear they can be customized, but folks will not "accidentally" be getting latest.

why are we introducing a test matrix of node 20 and 22

IMHO, it is a best practice to confirm your Node project builds on each of the supported LTS versions. The impetus behind making the change now was that semantic-release and conventional-commits dropped support for Node 18. So, I had to update the Node version used by the release.yml and conventional-commits.yml. Because of that, I also wanted to use the same Node version in the integration-tests.yml so that if there was a Node issue running npm ci, it would happen in integration-tests.yml before we ever get to release.yml.

I could have just set everything to build with either Node 20 or 22. The downside of using just 20 is that we will have to update the workflow config again sooner (when 20 is EOL) and developers might run into unknown issues if they try building the project with Node 22. The downside of using just 22 is the opposite. Developers using 20 might run into unknown issues since we only test with 22. So, I decided the best balance was to uplift the release/conventional-commits stuff to use 22 and then run our tests with both 20 and 22 to make sure everything gets covered....

mrjones-plip commented 2 months ago

Awesome! Agreed on your plan to comment out the variables in the env.example file - good thinking!

Also - thanks for explaining the logic of the node test matrix - makes perfect sense.

Finally, I think we should dogfood this change. I propose:

  1. I'll do a test upgrade locally of current to branch and see how it goes (mainly w/ grafana)
  2. we do the same on production medic watchdog
  3. if all goes well, we merge this branch to main

I think it's a bit of overkill, but I also think there's no rush to merge and we can slow roll the PR for a week or two as needed while we test and watch the burn in on prod.

jkuester commented 2 months ago

I like you plan of dogfooding this change! :+1: Since it is a major version bump for Grafana, I think it probably deserves a closer look than normal.

But, also like you said, no rush on any of this!

mrjones-plip commented 2 months ago

OK on prod watchdog we have:

I'm going to test setting these up locally, throw some data in there and then cut over to the 112_upgrade_services branch and see what happens :crossed_fingers: To first bootstrap on the older versions I'll have to (ironically) pin them to these versions, otherwise, you know, they'll go to latest :laughing:

root@watchdog:~# hostname
watchdog.app.medicmobile.org

root@watchdog:~# docker image ls|grep prom/prometheus
prom/prometheus                         latest     e350b167c4fa   5 months ago    262MB
prom/prometheus                         <none>     1d3b7f56885b   5 months ago    262MB
prom/prometheus                         <none>     75972a31ad25   16 months ago   234MB

root@watchdog:~# docker image inspect prom/prometheus| jq ".[0].RepoDigests"
[
  "prom/prometheus@sha256:dec2018ae55885fed717f25c289b8c9cff0bf5fbb9e619fb49b6161ac493c016"
]
mrjones-plip commented 2 months ago

local dev upgrade went super good! I set my .env file to below, stood up an instance and let it gather data from gamma and moh mali for a good 20 min. then i docker compose down , then checked out this branch and commented out my 3 versions in .env and did a docker compose pull

after checking images in docker image ls looked good - i did a docker compose up -d. Everything more or less instantly upgraded!

I've verified that prod watchdog is backed up in EC2 snapshots, so I'll do the prod upgrade next Tue when I'm back from being out on Monday!

starting .env file

grep = .env                                                                                                                                          
GRAFANA_ADMIN_USER=medic                                                                 
GRAFANA_ADMIN_PASSWORD=password                                                                                                                                                   
GRAFANA_VERSION=10.4.1                                                                                                                                                            
GRAFANA_PORT=3000                                                                                                                                                                 
GRAFANA_BIND=127.0.0.1                                                                                                                                                            
GRAFANA_DATA="./grafana/data"                                                                                                                                                     
GRAFANA_PLUGINS=grafana-discourse-datasource                                                                                                                                      
JSON_EXPORTER_VERSION=latest                                                                                                                                                      
PROMETHEUS_VERSION=v2.51.1                                                                                                                                                        
PROMETHEUS_DATA="./prometheus/data"                                                                                                                                               
PROMETHEUS_RETENTION_TIME=60d                                                                                                                                                     
SQL_EXPORTER_IP=127.0.0.1                                                                                                                                                         
SQL_EXPORTER_PORT=9399                                                                   
PROMETHEUS_BIND=127.0.0.1                                                                                                                                                         
PROMETHEUS_PORT=9090   
mrjones-plip commented 1 month ago

production is updated:

  1. ssh in to prod watchdog
  2. checkout this branch 112_upgrade_services
  3. stop cronjob to run every 5 min to check for new versions, so it doesn't clobber the version we go to
  4. run down:
    cd ~/cht-monitoring                           
    docker compose \
        -f docker-compose.yml \
        -f exporters/postgres/compose.yml \
        -f ../caddy-compose.yml \                                                        
        -f ../docker-compose-cht3x.yml \
        -f data-ingest/extra-sql-compose.yml \
        -f node-exporter/compose.yml \                                                   
        down   
  5. run pull:
    docker compose \
        -f docker-compose.yml \
        -f exporters/postgres/compose.yml \
        -f ../caddy-compose.yml \
        -f ../docker-compose-cht3x.yml \
        -f data-ingest/extra-sql-compose.yml \
        -f node-exporter/compose.yml \
    pull
  6. run up:
    docker compose \
        -f docker-compose.yml \
        -f exporters/postgres/compose.yml \
        -f ../caddy-compose.yml \
        -f ../docker-compose-cht3x.yml \
        -f data-ingest/extra-sql-compose.yml \
        -f node-exporter/compose.yml \
    pull
  7. verify versions are good - prometheus:
    curl 172.21.0.3:9090/api/v1/status/buildinfo|jq
    {
     "status": "success",
     "data": {
       "version": "2.54.1",
       "revision": "e6cfa720fbe6280153fab13090a483dbd40bece3",
       "branch": "HEAD",
       "buildUser": "root@812ffd741951",
       "buildDate": "20240827-10:56:41",
       "goVersion": "go1.22.6"
     }
    }

    and at login grafana shows: 11.2.0 (2a88694fd3)

  8. it's been over 10 min and data seems to keep flowing - I'm calling it good!

Over to @jkuester to finish up this PR

mrjones-plip commented 1 month ago

@jkuester - when you merge this to main, please go un-comment cronjob on watchdog so it starts pulling again automatically:

root@watchdog:~/cht-monitoring# crontab -l

# check for new cht watchdog version, upgrade if new version & announce in slack
#*/5  * * * * /root/continious-deployment.sh
jkuester commented 1 month ago

@mrjones-plip I need you to hit the Approve button (from the "Files changed" tab) before I can actually merge this! :sweat_smile:

image

mrjones-plip commented 1 month ago

Sorry! shoulda remembered that.

medic-ci commented 1 month ago

:tada: This PR is included in version 1.15.0 :tada:

The release is available on GitHub release

Your semantic-release bot :package::rocket: