medic / cht-watchdog

Configuration for deploying a monitoring/alerting stack for CHT
GNU Affero General Public License v3.0
4 stars 7 forks source link

Using custom queries is deprecated in postgres_exporter #81

Closed jkuester closed 6 months ago

jkuester commented 1 year ago

intro

So, apparently the custom queries functionality in postgres_exporter which allows us to collect metrics based on the data in the postgres database (aka pretty much the only reason we are using the postgres_exporter) has been deprecated. It seems that the maintainers of postgres_exporter view the main purpose of the project as to provide metrics specific to the inner workings of the Postgres instance. They recommend using a different exporter for collecting metrics from the actual data in the Postgres database.

Test setup

pinning these to the top of the ticket, @mrjones-plip to keep up to date:

  1. check out these repos
  2. using script/docker-helper-4.x directory in CHT Core repo, start a docker helper instance of CHT core - note the URL and Port.
  3. assuming docker helper gave you a URL and port of 192-168-68-17.local-ip.medicmobile.org:10464 - start your couch2pg instance by cd into the cht-couch2pg directory and running:
    COUCH2PG_SLEEP_MINS=0.1 \
      COUCHDB_URL=https://medic:password@172-17-0-1.local-ip.medicmobile.org:10464/medic \
      docker compose up -d
  4. you can optionally connect with a postgres client to localhost:5432 with username cht_couch2pg , password cht_couch2pg_password to database cht to ensure connection is working
  5. cd into watchdog repo directory and check out 81-sql-exporter repo
  6. still in watchdog repo, update your cht-instances.ylm to have the URL from step 1. In this example it's 192-168-68-17.local-ip.medicmobile.org:10464
  7. copy exporters/postgres/sql_servers_example.yml to exporters/postgres/sql_servers.yml. Update the value in 172-17-0-1.local-ip.medicmobile.org:10464 to match the CHT Core URL from step 2 above.
  8. still in the top level cht-watchdog directory, run the restart script:
    ./development/kill.start.ips.sh

Test steps

  1. From step 8's output above, look for the one called *-sql_exporter-* and go to that URL (http://172.30.0.4:9399/metrics in this case).

    Services:
    
    cht-watchdog-grafana-1          http://172.30.0.3:3000
    cht-watchdog-prometheus-1       http://172.30.0.5:9090/targets?search=
    cht-watchdog-json-exporter-1    http://172.30.0.2:7979/metrics
    cht-watchdog-sql_exporter-1     http://172.30.0.4:9399/metrics
  2. ensure the web page looks like this:
    # HELP couch2pg couch2pg backlog.
    # TYPE couch2pg gauge
    couch2pg{db="_users",job="db_targets",target="local-cht"} 1
    couch2pg{db="medic",job="db_targets",target="local-cht"} 186
    couch2pg{db="medic-logs",job="db_targets",target="local-cht"} 10
    couch2pg{db="medic-sentinel",job="db_targets",target="local-cht"} 79
    couch2pg{db="medic-users-meta",job="db_targets",target="local-cht"} 3
    # HELP scrape_duration_seconds How long it took to scrape the target in seconds
    # TYPE scrape_duration_seconds gauge
    scrape_duration_seconds{job="db_targets",target="local-cht"} 0.008943289
    # HELP up 1 if the target is reachable, or 0 if the scrape failed
    # TYPE up gauge
    up{job="db_targets",target="local-cht"} 1
  3. log into the dev watchdog instance at http://localhost:3000 (user medic password password) and go to the main "admin overview" dashboard. ensure the "Couch2pg Backlog" panel is working. It should show a backlog of0`: image
  4. using the name of the container on step 1 above in the section, stop the couch2pg container. (eg docker stop cht-couch2pg-cht-couch2pg-1).
  5. add a household to your cht instance. after a few min you should see a backlog greater than zero
  6. start the couch2pg container. (eg docker start cht-couch2pg-cht-couch2pg-1) and you should see the backlog go back to 0
jkuester commented 1 year ago

When addressing this, it would be a good idea to consider how we will support for getting metrics from a db populated by cht-sync vs couch2pg. Both should have first-class support and will probably use very similar exporters/configurations.

(Also worth calling out that switching exporters may result in a breaking change that needs a major version bump of Watchdog (unless the format of the DB connection config files happens to match....))

andrablaj commented 1 year ago

@lorerod FYI about the cht-sync v& couch2pg support

mrjones-plip commented 7 months ago

@jkuester - thanks again for the call earlier! Here's a status update that might be easier to digest.

cc @eljhkrr

Demo steps

  1. set up SSH tunnel to RDBMs to expose postgres on localhost (I often bind it to the shared docker interface at 172.17.0.1 as then both my local workstation and docker containers can access it: ssh -L 172.17.0.1:5432:localhost:5432 mrjones@rdbms.dev.medicmobile.org -p 34796)
  2. use the 81-sql-exporter branch per my PR
  3. copy the sql_servers_example.yml file to sql_servers.yml and add two targets: cmmb-kenya-app and moh_mali_chw
  4. run the compose to start the server from the root of cht-watchdog directory: docker compose -f docker-compose.yml -f exporters/postgres/compose.yml up -d
  5. do a curl on the /metrics endpoint on the now running exporter. note that the result is: metric (1) * databases (5) * instances (2) = total metrics (10).
  6. Final result is this HTML that is ready for prometheus to scrape (not yet implemented)

Demo HTML (er "HTML")

    # HELP couch2pg couch2pg backlog.
    # TYPE couch2pg gauge
    couch2pg{db="_users",job="db_targets",target="cmmb-kenya-app"} 1329
    couch2pg{db="_users",job="db_targets",target="moh_mali_chw"} 3414
    couch2pg{db="medic",job="db_targets",target="cmmb-kenya-app"} 1.44244e+06
    couch2pg{db="medic",job="db_targets",target="moh_mali_chw"} 3.308697e+06
    couch2pg{db="medic-logs",job="db_targets",target="cmmb-kenya-app"} 0
    couch2pg{db="medic-logs",job="db_targets",target="moh_mali_chw"} 91042
    couch2pg{db="medic-sentinel",job="db_targets",target="cmmb-kenya-app"} 1.94765e+06
    couch2pg{db="medic-sentinel",job="db_targets",target="moh_mali_chw"} 6.566893e+06
    couch2pg{db="medic-users-meta",job="db_targets",target="cmmb-kenya-app"} 7018
    couch2pg{db="medic-users-meta",job="db_targets",target="moh_mali_chw"} 12386
    # HELP scrape_duration_seconds How long it took to scrape the target in seconds
    # TYPE scrape_duration_seconds gauge
    scrape_duration_seconds{job="db_targets",target="cmmb-kenya-app"} 0.297466424
    scrape_duration_seconds{job="db_targets",target="moh_mali_chw"} 0.299140411
    # HELP up 1 if the target is reachable, or 0 if the scrape failed
    # TYPE up gauge
    up{job="db_targets",target="cmmb-kenya-app"} 1
    up{job="db_targets",target="moh_mali_chw"} 1

Demo video

https://github.com/medic/cht-watchdog/assets/8253488/5805ed2f-dd19-4314-9d1f-c0df1dd185d0

mrjones-plip commented 6 months ago

once https://github.com/medic/cht-docs/pull/1367 is merged I'll close this ticket

lorerod commented 6 months ago

Tested this locally. MacOS Sonoma Docker desktop 4.29.0 Docker compose v1.29.2

I followed this steps with a couple of tweaks for Mac.

  1. Followed this steps and changed the dev script first line to #!/usr/local/bin/bash

  2. Get the IP for my computer on the WiFi and then use that everywhere the 172.17.0.1 IP was used in the test steps.

  3. http://127.0.0.1:9399/metrics looks like this:

    # HELP couch2pg_progress_sequence couch2pg backlog.
    # TYPE couch2pg_progress_sequence counter
    couch2pg_progress_sequence{db="_users",job="db_targets",target="192-168-100-62.local-ip.medicmobile.org:10456"} 1
    couch2pg_progress_sequence{db="medic",job="db_targets",target="192-168-100-62.local-ip.medicmobile.org:10456"} 182
    couch2pg_progress_sequence{db="medic-logs",job="db_targets",target="192-168-100-62.local-ip.medicmobile.org:10456"} 10
    couch2pg_progress_sequence{db="medic-sentinel",job="db_targets",target="192-168-100-62.local-ip.medicmobile.org:10456"} 79
    couch2pg_progress_sequence{db="medic-users-meta",job="db_targets",target="192-168-100-62.local-ip.medicmobile.org:10456"} 3
    # HELP scrape_duration_seconds How long it took to scrape the target in seconds
    # TYPE scrape_duration_seconds gauge
    scrape_duration_seconds{job="db_targets",target="192-168-100-62.local-ip.medicmobile.org:10456"} 0.0011491000000000001
    # HELP up 1 if the target is reachable, or 0 if the scrape failed
    # TYPE up gauge
    up{job="db_targets",target="192-168-100-62.local-ip.medicmobile.org:10456"} 1

Thanks!

mrjones-plip commented 6 months ago

Thanks @lorerod for all the testing! Much appreciated.

Now that your testing went well and the docs PR is merged, closing out this ticket.