[Feat]: Add more metrics to PostgreSQL Go collector

shyamvalsan commented 2 years ago

Problem

PostgreSQL collector is missing some important metrics

Description

Add following metrics to Go collector:

1. Database count 2. Backend count 3. Dead Row Ratio 4. Number of active queries 5. Number of idle queries 6. Number of idle in transaction queries 7. Number of slow queries (>10s) 8. Number of slow queries (>1s) 9. Number of slow queries (>100ms) 10. Index cache hit ratio 11. Hot updated rows 12. Last vacuum age 13. Last autovacuum age 14. Last analyze age 15. Last autoanalyze age 16. Long running transactions count 17. Count of table columns which are entirely NULL 18. Count of table columns with only one value 19. Age of oldest table vacuum 20. Age of oldest table analyze 21. Count of tables not vacuumed in last 7 days 22. Count of tables not analyzed in last 7 days 23. Count of unused indexes 24. Count of blocked queries 25. Percentage of tables with bloat 26. Total size of table bloat 27. Percentage of total table size that is bloated 28. Percentage of indexes with bloat 29. Total size of index bloat 30. Percentage of total index size that is bloated 31. Number of sequential scans

Importance

must have

Value proposition

Make PostgreSQL monitoring use-case more useful

Proposed implementation

[x] Database count

SELECT COUNT(*) datname from pg_database;

[x] Backend count
```
SELECT count(*) FROM pg_stat_activity;
```

[x] Dead Row Ratio

SELECT relname AS TableName, n_live_tup AS LiveTuples, n_dead_tup AS DeadTuples, n_dead_tup/NULLIF(n_live_tup,0) AS DeadRowRatio FROM pg_stat_user_tables;

[x] Number of active queries

SELECT COUNT(*) FROM pg_stat_activity WHERE state='active';

[x] Number of idle queries

SELECT COUNT(*) FROM pg_stat_activity WHERE state='idle';

[x] Number of idle in transaction queries

SELECT COUNT(*) FROM pg_stat_activity WHERE state='idle in transaction';

[x] Number of slow queries (>10s)

SELECT COUNT(*) query FROM pg_stat_activity WHERE now() - query_start > '10 seconds'::interval;

[x] Number of slow queries (>1s)

SELECT COUNT(*) query FROM pg_stat_activity WHERE now() - query_start > '1 seconds'::interval;

[x] Number of slow queries (>100ms)

SELECT COUNT(*) query FROM pg_stat_activity WHERE now() - query_start > '0.1 seconds'::interval;

[x] Index cache hit ratio

SELECT SUM(idx_blks_read) AS idx_blks_read, SUM(idx_blks_hit)  AS idx_blks_hit, SUM(idx_blks_hit) / GREATEST(SUM(idx_blks_hit) + SUM(idx_blks_read), 1)::float AS ratio FROM pg_statio_user_indexes;

[x] Hot updated rows

SELECT relname AS TableName, n_tup_hot_upd AS HotUpdated FROM pg_stat_user_tables;

[x] Last vacuum age

SELECT relname AS TableName, now() - last_vacuum AS LastVacuumAge FROM pg_stat_user_tables ORDER by now() - last_vacuum DESC;

[x] Last autovacuum age

SELECT relname AS TableName, now() - last_autovacuum AS LastAutovacuumAge FROM pg_stat_user_tables ORDER by now() - last_autovacuum DESC;

[x] Last analyze age

SELECT relname AS TableName, now() - last_analyze AS LastAnalyzeAge FROM pg_stat_user_tables ORDER by now() - last_analyze DESC;

[x] Last autoanalyze age

SELECT relname AS TableName, now() - last_autoanalyze AS LastAutoAnalyzeAge FROM pg_stat_user_tables ORDER by now() - last_autoanalyze DESC;

[x] Long running transactions count (older than 24 hours)

SELECT count(*) FROM pg_stat_activity WHERE xact_start < now() - '24 hours'::interval;

[x] Count of tables not vacuumed in last 7 days

SELECT schemaname || '.' || relname, now()-last_vacuum FROM pg_stat_all_tables WHERE last_vacuum < now() - '7 days'::interval;

[x] Count of tables not analyzed in last 7 days

SELECT schemaname || '.' || relname, now()-last_analyze FROM pg_stat_all_tables WHERE last_analyze < now() - '7 days'::interval;

[x] Age of oldest table vacuum

SELECT schemaname || '.' || relname, now()-last_vacuum FROM pg_stat_all_tables ORDER BY last_vacuum ASC NULLS LAST LIMIT 1;

[x] Age of oldest table analyze

SELECT schemaname || '.' || relname, now()-last_analyze FROM pg_stat_all_tables ORDER BY last_analyze ASC NULLS LAST LIMIT 1;

[x] Number of sequential scans

SELECT SUM(seq_scan) as seq_scans FROM pg_stat_user_tables;

shyamvalsan commented 2 years ago

cc: @thiagoftsm @vlvkobal @ilyam8

shyamvalsan commented 2 years ago

There's a few more remaining, but those need some more work, will create a separate ticket for those. Also need to spec out how we'll visualize queries/tables/indexes in tabular form.

cc: @cakrit @sashwathn @amalkov

thiagoftsm commented 2 years ago

Hello @shyamvalsan,

Some questions and suggestions:

why do you consider that > 100ms is slow?

Are we going to show datnames without a name?

postgres=# SELECT datname,numbackends FROM pg_catalog.pg_stat_database ORDER BY numbackends DESC, datname ASC;
datname  | numbackends 
-----------+-------------
postgres  |           1
template0 |           0
template1 |           0
       |           0

Let us avoid one run more than one query to get data from one table, I suggest to use:

> SELECT COUNT(*),state FROM pg_stat_activity WHERE state in ('idle', 'idle in transaction', 'active') group by state;

and create charts for the queries.

Probably we want to use width_bucket to work with timestamp and aggregate data. :thinking:

ilyam8 commented 2 years ago

Backends per database

We have it

shyamvalsan commented 2 years ago

@ilyam8 - I've removed "Backend per database" from the description. I missed it somehow 😄

@thiagoftsm - good suggestions.

For queries I would have ideally liked a histogram chart and we could put queries into 0-100ms, 100-1000ms, 1000-10000ms and 10000+ buckets. Is that feasible? 100ms is not necessarily slow, but a 900ms query is slow.. a histogram would be the best way to display all of this in one chart.
👍 for combining the three queries into one while retrieving data

thiagoftsm commented 2 years ago

@shyamvalsan this was my thought when I suggested to use width_bucket, probably we could create a chart like we have now for filesystem latencies.

shyamvalsan commented 2 years ago

Hmm - if we can do histogram view, then there might be a quick win to show:

Table bloat
Index bloat

As histograms, so we can see, for example, what % of tables have a bloat > 10%

I was previously considering we needed the new table charts to show this data - and we still do, to be able to pin point which table or index has how much bloat. But a quick indication of whether or not you have a bloat problem can be done with a histogram.

thiagoftsm commented 2 years ago

@shyamvalsan I agree that we still need a chart type for this, but for while we are using stack charts.

shyamvalsan commented 2 years ago

@ilyam8 @vlvkobal I have updated the ticket description to add more metrics (metrics 17 to 30 are newly added, quoting them below) - have also updated the proposed implementation section to list the queries that will give the values for these metrics OR links to the queries in GH.

Please let me know in case something doesn't make sense.

Count of table columns which are entirely NULL

Count of table columns with only one value

Age of oldest table vacuum

Age of oldest table analyze

Count of tables not vacuumed in last 7 days

Count of tables not analyzed in last 7 days

Count of unused indexes

Count of blocked queries

Percentage of tables with bloat

Total size of table bloat

Percentage of total table size that is bloated

Percentage of indexes with bloat

Total size of index bloat

Percentage of total index size that is bloated

Number of sequential scans

cc: @amalkov @sashwathn

ilyam8 commented 2 years ago

30 queries. We need to measure how much doing so many queries a second affects Postgres. In addition, all of them should be tested with Postgres starting from version 10 (100% some of the queries are version specific).

thiagoftsm commented 2 years ago

30 queries. We need to measure how much doing so many queries a second affects Postgres. In addition, all of them should be tested with Postgres starting from version 10 (100% some of the queries are version specific).

I agree with every word written here. We cannot run this every second, else we could become the client consuming more resources from server (Of course this would happen on small servers), probably this needs to be at least every 10 seconds.

shyamvalsan commented 2 years ago

@ilyam8 @thiagoftsm Agree.

We do not really need every metric listed here at 1 second granularity, that will be overkill. For some of these metrics a 30s or even longer granularity is perfectly acceptable. Will review the list from that perspective.
While some of the queries (and metrics) are version specific, if my memory serves, most (all?) should work with version 10+ (but definitely should be tested, I have only tested these on the latest postgres) The GH links to queries explicitly lists versions tested with.

thiagoftsm commented 2 years ago

@shyamvalsan only to confirm, when you say latest postgresql, are you talking about lateststable or the beta? The mentioned versions are here.

shyamvalsan commented 2 years ago

Here's what I used. Looks like this isn't latest anymore :)

shyam_db=# SELECT version();
                                                           version
-----------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 14.4 (Debian 14.4-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit

ilyam8 commented 2 years ago

1 second granularity, that will be overkill. For some of these metrics a 30s or even longer

Great, but it is not supported by go.d.plugin. Current implementation - all the data collection job charts have the same update every.

shyamvalsan commented 2 years ago

Ah, I remember hearing about this limitation. But what is the reason behind it? Just the way things have always been or any technical blockers?

sashwathn commented 2 years ago

@shyamvalsan : I had identified a couple of alerts and I am not sure if we have the necessary metrics for these:

[x] Too many sequential scans on the database (Suggestion to create indexes on frequently accessed tables)
[ ] Resource intensive queries > 10% (Amount of data written temporarily to disk to execute queries. Suggestion to check the work_mem)

Do we have a way of identifying the sequential scans and data written to disk by queries?

cc: @ilyam8 @thiagoftsm @vlvkobal

thiagoftsm commented 2 years ago

Hello @shyamvalsan,

Probably we can get them using pg_stat tables, but I am not sure what whether what we need is enabled by default.

shyamvalsan commented 2 years ago

@sashwathn @thiagoftsm I thought we already had sequential vs index scan ratio :) good catch that we missed it.

It's available in pg_stat_user_tablesat a per table level

postgres=# select relname,idx_scan, seq_scan from pg_stat_user_tables;
     relname      | idx_scan | seq_scan
------------------+----------+----------
 pgbench_tellers  |        0 |        1
 pgbench_accounts | 43884983 |        2
 pgbench_history  |          |        0
 pgbench_branches |        0 |     4391

If we just need the count of sequential scans then

postgres=# SELECT SUM(seq_scan) as seq_scans FROM pg_stat_user_tables;
 seq_scans
-----------
      4394
(1 row)

@sashwathn added number of sequential scans to the list of metrics to be added in description 👍

shyamvalsan commented 2 years ago

@sashwathn as for resource intensive queries, would this be what you are looking for?

anayrat commented 2 years ago

You shouldn't put bloat queries in this. They are pretty expensive and could overload big databases.

sashwathn commented 2 years ago

@sashwathn as for resource intensive queries, would this be what you are looking for?

@shyamvalsan Yes this is what I was looking for and it will be a good add on to our list of metrics where we can point to specific queries that cannot work with the current work_mem.

shyamvalsan commented 2 years ago

@anayrat do you mean we shouldn't query for bloat very often or not at all? we were thinking of allowing users to do a live query for top tables by bloat if they wanted to see it.

anayrat commented 2 years ago

@anayrat do you mean we shouldn't query for bloat very often or not at all? we were thinking of allowing users to do a live query for top tables by bloat if they wanted to see it.

Not often should be fine.

shyamvalsan commented 2 years ago

Ah, I remember hearing about this limitation. But what is the reason behind it? Just the way things have always been or any technical blockers?

Should we create a feature request to support multi granularity metrics per collector?

@ilyam8 ?

ilyam8 commented 2 years ago

pg_stat_user_tables contains data for the current database (the one we connected to). So, two questions:

Do we need per table charts (metrics from pg_stat_user_tables, labels "database", "table", "schema") ?
Do we need to collect table metrics from all databases? It means we need num of connection == num of databases.

Ah, I think I found the answer. It is yes.

ilyam8 commented 2 years ago

I added metrics that were kind of easy to understand for me. The rest metrics are questionable and I will postpone adding them until additional explanation/confirmation.

Number of slow queries (>10s, etc)

Not clear what we need here. pg_stat_activity contains both ended and running queries. I guess running?

Long running transactions count (older than 24 hours)

I don't understand where 24 come from, why not 12, and if that is even possible (24hr + transactions). Can we see a reason behind this request? Who does use it?

Count of table columns which are entirely NULL Count of table columns with only one value Count of tables not vacuumed in last 7 days

the rest.

Same question. Basically, why is this so important?

I am trying to avoid additional queries if we don't really need them.

shyamvalsan commented 2 years ago

@ilyam8 OK, will try to go metric by metric for the pending ones. (@sashwathn please add your thoughts as well)

Number of slow queries

The intent is to discover the prevalence of slow queries and to potentially alert users if there are a lot of slow queries so they can investigate further and take action.

As @thiagoftsm suggested above, a histogram view is the best option - we do NOT need separate metrics for slow queries of different duration. So in a single chart if we can show the following counts (queries <=10ms, queries >10ms && <=100ms, queries >100ms && <=1s, queries >1s && <=10s, queries>10s) this will cover our use-case.

Index cache hit ratio

Any confusion or questions here? this is distinct from overall buffer cache hit ratio.

Long running transactions count (older than 24 hours)

The 24 hours was an example/placeholder I used (my bad) - for a more reasonable number here we could look to our production database?

But it is important to monitor long running transactions since any rows updated in a transaction may cause other transactions to wait. Those transactions may also be holding locks, and now they are stalled, which may cause them to block others and degrading performance.

Count of table columns which are entirely NULL

Count of table columns with only one value

Count of unused indexes

All of the above are important to tell the user if there are potential optimizations that could be done to their databases. Are there columns that are not really needed? Are tables structured incorrectly? Have we created indexes we don't really need? This is a quick way to answer these questions.

Count of tables not vacuumed in last 7 days

Count of tables not analyzed in last 7 days

Age of oldest table vacuum

Age of oldest table analyze

We need some indication regarding whether vacuum and analyze are functioning as they are supposed to on this postgres cluster. These metrics will help us get an idea about this. We could potentially alert if age of oldest vacuum/analyze grow beyond a threshold.

Count of blocked queries

I think this was something that was present in the old Python collector but went away when we migrated to Go. We should monitor it if we can. As part of interactive queries we can show a tabular view of which pids are blocking which queries as well - but until we get there we can at least show a count.

Number of sequential scans

Sequential scans are slow, and especially for large databases we shouldn't be seeing a lot of them. A large spike or constant increase in sequential scans should be something the user should be concerned about.

Percentage of tables with bloat

Total size of table bloat

Percentage of total table size that is bloated

Percentage of indexes with bloat

Total size of index bloat

Percentage of total index size that is bloated

All of the above are related to bloat statistics - at a table level and at an index level. This is really important from a PostgreSQL monitoring perspective - the Percona guys found that we had serious bloat issues on our dbs while monitoring them. we definitely need a way to monitor this. And while we need to wait for interactive querying and tabular charts to show which tables have how much bloat - the very minimum we should be doing is telling the user how much overall bloat they have, how many of their tables and indexes are bloated.

ilyam8 commented 2 years ago

Number of slow queries

As I mentioned, there are running and ended queries. A query could have already ended. But the query_start is still in view. I don't know how to not count the same "slow" query several times.

Long running transactions count (older than 24 hours)

Ok, not needed then according to your explanation. I am not sure metrics with placeholders fall into the important category.

Count of table columns which are entirely NULL Count of table columns with only one value

According to the description that is something that shouldn't be collected every second, so skipping it.

Count of tables not vacuumed in last 7 days Count of tables not analyzed in last 7 days

We have per table last auto/manual vacuum, auto/manual analyze.

Age of oldest table vacuum Age of oldest table analyze

Still not clear why this is needed as a metric. If we need an alert - we apply it to every table.

Number of sequential scans

Been added.

Bloat stuff

Not added as per comment.

@shyamvalsan, is not really required, but grouping tasks and splitting them into several issues (e.g. issue1: all metrics that come from pg_stat_user_tables, issue2: metrics from pg_stat_user_indexes.) would help in closing issues. Otherwise, it may hang for a long time (like this one).

shyamvalsan commented 2 years ago

@ilyam8 fair point - will keep in mind when creating tickets next time. For now maybe we can break out certain metrics to a separate ticket?

As I mentioned, there are running and ended queries. A query could have already ended. But the query_start is still in view. I don't know how to not count the same "slow" query several times.

We shouldn't filter them out - we should show all of them, including ended queries, as long as there is no double counting.

As for the metrics we can't (or shouldn't) collect every 1 second let's create another ticket and move them there. We need to support multi granularity soon.

ilyam8 commented 2 years ago

as long as there is no double counting.

I don't know how to do that. For idle/idle-in-transaction – this is the last query that started at query_start and finished at state_change. Next time the query will return it again. So now() - query_start doesn't work for those states, only for active.

ilyam8 commented 2 years ago

SELECT SUM(idx_blks_read) AS idx_blks_read, SUM(idx_blks_hit) AS idx_blks_hit, SUM(idx_blks_hit) / GREATEST(SUM(idx_blks_hit) + SUM(idx_blks_read), 1)::float AS ratio FROM pg_statio_user_indexes;

Want to highlight this again - this query returns the overall cache hit rate. I believe you didn't mean it, but the data collection (per second or whatever update every is) hit rate.

shyamvalsan commented 2 years ago

@ilyam8 Now I am confused. These two are different. I assume the first one is what we are collecting today, I was asking for 2nd one to be added too.

Postgres cache hit ratio SELECT sum(heap_blks_read) as heap_read, sum(heap_blks_hit) as heap_hit, sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) as ratio FROM pg_statio_user_tables;

Postgres index cache hit ratio SELECT sum(idx_blks_read) as idx_read, sum(idx_blks_hit) as idx_hit, (sum(idx_blks_hit) - sum(idx_blks_read)) / sum(idx_blks_hit) as ratio FROM pg_statio_user_indexes;

ilyam8 commented 2 years ago

What is 1st, and what is 2nd? All ratios we collect are not overall ratios but per update_every.

shyamvalsan commented 2 years ago

I don't know how to do that. For idle/idle-in-transaction – this is the last query that started at query_start and finished at state_change. Next time the query will return it again. So now() - query_start doesn't work for those states, only for active.

SELECT COUNT(*) query FROM pg_stat_activity WHERE now() - query_start > '0.1 seconds'::interval;

A query like this returns the current number of queries that matched the condition at any given instant - so why does query_start or state_change matter? Maybe I am not getting your concern with this one.

shyamvalsan commented 2 years ago

What is 1st, and what is 2nd? All ratios we collect are not overall ratios but per update_every.

1st = data you get from pg_statio_user_tables(db buffer cache hit) 2nd = data you get from pg_statio_user_indexes (index cache hit)

ilyam8 commented 2 years ago

SELECT COUNT(*) query FROM pg_stat_activity WHERE now() - query_start > '0.1 seconds'::interval;

The query doesn't work because of the reasons I mentioned above. We need an additional condition - the state should be active. Or should be a different metric.

ilyam8 commented 2 years ago

@shyamvalsan

this is the first time you mention pg_statio_user_tables. It is not in the OP.
my point was different: overall cache hit vs per update every cache hit.

ilyam8 commented 2 years ago

Also, both pg_statio_user_tables and pg_statio_user_indexes require addition connection - will be disabled by default, and collect from specific databases when enabled.

ilyam8 commented 2 years ago

For active queries running time I use the following query:

SELECT datname,
       EXTRACT(epoch from now() - query_start) as active_query_running_time
FROM pg_stat_activity
WHERE datname IS NOT NULL
  AND state = 'active'
  AND backend_type = 'client backend';

The default buckets are:

.1, .5, 1, 2.5, 5, 10

The buckets are user configurable.

shyamvalsan commented 2 years ago

@ilyam8 should the default aggregation be something other than avg for the active queries running time? 3.38 queries/second sounds a little odd.

Maybe SUM?

ilyam8 commented 2 years ago

It doesn't do any aggregation, it is Netdata interpolation for incremental metrics.

3.38 queries/second sounds a little odd

What is odd? Netdata does interpolation (a second boundary).

shyamvalsan commented 2 years ago

Yes, I meant the chart option for interpolation, is this something that needs to be set on FE or collector for a particular chart?

ilyam8 commented 2 years ago

@shyamvalsan it is how Netdata works. It calculates per second (0, 1, 2, 3, ...) rate for incremental metrics (all charts with */s prefix).

ilyam8 commented 2 years ago

SELECT SUM(idx_blks_read) AS idx_blks_read, SUM(idx_blks_hit) AS idx_blks_hit, SUM(idx_blks_hit) / GREATEST(SUM(idx_blks_hit) + SUM(idx_blks_read), 1)::float AS ratio FROM pg_statio_user_indexes;

The query is per table, schema, and index. I am not adding it, because I am not sure we need such details (per db, per schema, per table, per index).

Added collecting pg_statio_user_tables in netdata/go.d.plugin#808 that contains disk/memory reads from all indexes on a table. The view (per table) contains:

disk/memory reads.
index disk/memory reads.
toast disk/memory reads.
toast index disk/memory reads.

ilyam8 commented 2 years ago

Sharing this in case I made a mistake (not as a specialist).

Updated query running time in netdata/go.d.plugin#810 to

SELECT datname,
       EXTRACT(epoch from now() - xact_start)  as active_xact_running_time,
       EXTRACT(epoch from now() - query_start) as active_query_running_time
FROM pg_stat_activity
WHERE datname IS NOT NULL
  AND state = 'active'
  AND backend_type = 'client backend';

So we have both transactions and queries running time histogram charts. The default buckets for both are the same (.1, .5, 1, 2.5, 5, 10).

shyamvalsan commented 2 years ago

@ilyam8 Looks ok to me but need to see the values in an actually operational DB, was wondering why the values are same but it is expected in this scenario I think https://pgstats.dev/pg_stat_activity

ilyam8 commented 2 years ago

@shyamvalsan probably the query is wrong and we need to take into account not only active state for transactions, but idle in transaction and idle in transaction (aborted) too. At the moment I don't have enough understanding to confirm/deny that.

ilyam8 commented 2 years ago

@shyamvalsan we see only the snapshot but not the history between data collections - the number of transactions will be equal to the number of queries and that is expected (when using only active). And I just understand that units are misleading - it is not the number of queries/s or transactions/s (looks like N queries/transactions have been done) but the number of observations, because if the query/transaction is long-running we will be seeing it for N number of data collections.

ilyam8 commented 2 years ago

Updated the query in netdata/go.d.plugin#812 and now the number of observed transactions > queries is expected (because we track "idle in transaction" state).

netdata / netdata