Recent PoWA loses statistics on PostgreSQL 13 (but not 12)

Natureshadow commented 3 years ago

Hi,

on several independent setups, we are facing an issue with the statistics collected by PoWA

Setup

OS: Debian 10 (buster), amd64
PostgreSQL 13.1
PoWA 4.1.2
hypopg 1.1.4
qualstats 2.0.2
kcache 2.2.0

All packages installed from the pgdg (apt.postgresql.org) apt repositories.

Special configuration

Retention set to one year

Issue

We have been collecting data on systems with differing loads and reproducibly see statistics making unexpected changes. In particular:

Values that should only increase (e.g. call counts, run time) decrease randomly, sometimes after a few hours, sometimes on next page load in PoWA-Web
Predicates and most filtered/least filtered values in query optimisation are missing values (like, a query that definitely ran a few thousand time swith different predicates shows only one filtering value, and that one for both most and least filtering values)

Verification

The issue does not occur in PostgreSQL 12, on the same system with the same extension versions. Both PostgreSQL instances were tracked with the same PoWA-Web instance, so PoWA-Web does not seem to be the cause.

Cc: @MathisRudolf, @hansegucker, @debdolph

rjuju commented 3 years ago

Hi,

That's interesting. We did fix a similar bug recently, but that was on pg_wait_sampling data (see https://github.com/powa-team/powa-archivist/commit/8057613f345c03f6b9b6ab34506fe78e6fa6fea0). It could be something similar (erroneous JOIN condition somewhere), although it seem unlikely as this behavior was only seen for pg_wait_sampling. However, just in case do you have multiple postgres roles running the same queries?

According to you description, it happens on many data sources right? I believe it's happening on pg_stat_statements, is that frequent? Is the problem visible on all pages (like database-wide and per-query)? Is it constrained to some databases / queries or happening randomly?

Do you know if you have frequent eviction on pg_stat_statements entries? It's a bit hard to figure out right now (there was a patch committed recently that will help, but only starting on pg 14), but it the number of entries is close to pg_stat_statements.max this could be a sign.

We should first try to know if the problem is on the stored data or not. If you could find one occurrence of that issue relatively narrowed down (ideally a single query on a few hours interval) I could provide some query to verify that.

MathisRudolf commented 3 years ago

Hi,

However, just in case do you have multiple postgres roles running the same queries?

We tested with one role (postgres) and pgbench.

According to you description, it happens on many data sources right? I believe it's happening on pg_stat_statements, is that frequent? Is the problem visible on all pages (like database-wide and per-query)? Is it constrained to some databases / queries or happening randomly?

It seems randomly and database-wide. Do you know a way to confirm that it's happening on pg_stat_statements?

Do you know if you have frequent eviction on pg_stat_statements entries? It's a bit hard to figure out right now (there was a patch committed recently that will help, but only starting on pg 14), but it the number of entries is close to pg_stat_statements.max this could be a sign.

I doubt it, with just a few pgbench runs we didn't hit the limit of 5k and all statements are correctly tracked. With 130k calls we got 143 diffrent queries in pg_stat_statements.

We should first try to know if the problem is on the stored data or not. If you could find one occurrence of that issue relatively narrowed down (ideally a single query on a few hours interval) I could provide some query to verify that.

I did one test with pgbench yesterday. Maybe you can spot something?

2021-01-12 13-00-51:

2021-01-12 13-00-51

2021-01-12 14-33-24:

2021-01-12 14-33-24

rjuju commented 3 years ago

We tested with one role (postgres) and pgbench.

Ok, so one less possibility.

It seems randomly and database-wide.

ok

Do you know a way to confirm that it's happening on pg_stat_statements?

if any of the metrics reported by that extensions appear to be broken. Could be number of calls, shared block counters, local block counters... Some counters are kind of mixed with pg_stat_kcache though (for the hit ratio for instance), but that's done at the general query level, so if there's an issue with that extension it could probably affect pg_stat_statements counters too.

Does the issue happens quite frequently, or do you have to wait hours? Can you share some details about your server (number of cpu, memory, disk setup), and the exact command line you used with pgbench (and the scale factor for the init)? With a bit of luck I could reproduce the issue locally.

I did one test with pgbench yesterday. Maybe you can spot something?

Is the issue only happening on the grids, or also on the graphs? As far as I can tell you're using the default time interval (last hour), so depending on when the pgbench (and other queries activity) finished, the grid general counters can entirely change 1h30 apart.

Natureshadow commented 3 years ago

OK, let's write the story down as "@rjuju live-patched that selector into our PoWA instances", that reads a bit less dumb ;).

Really, that (obviously false) observation that the behaviour was different in PG 12 and 13 completely distracted us from such a simple mistake. I still do not know how that came to be — maybe just a nasty coincidence with the number of queries pgbench got through, or us focussing on PG 13 and spending more time there… I don't know.

In any case, forget it, and thanks for PoWA!

rjuju commented 3 years ago

Heh :)

powa-team / powa