Slow Performance for Busy Database

mikealbert commented 8 years ago

Hi,

Thanks for the great tool. We've recently deployed it to our Production environment and it has definitely helped us prioritize sql statements for tuning.

We've noticed that with one of our larger (20GB) and most active databases, performance of the POWA app suffers. In particular, the "Details for all queries" widget takes a very long time to display after an adjustment is made to the timeframe in the top right corner. In some cases it can take a couple minutes or longer for that data to display.

Let me know what other details I can provide to help troubleshoot. This only seems to be a problem with our most busiest database. We've had POWA deployed for only 5 days and the POWA database is 400MB.

Thanks!

rjuju commented 8 years ago

Hello,

Thank you very much.

First, can you confirm the version of powa that's installed ? Can you also specify which additional extensions are installed, with the version? Did you change any specific powa configuration? (powa.coalesce, powa.frequency...)

Can you also show what this query is returning:

select datname, count(*) from powa_statements s join pg_database d on d.oid = s.dbid group by 1 order by 2 desc;

Finally, is the display of the "Details for all queries" always slow, or only if you choose a bigger time interval? If the latter, what interval are you choosing?

Regards

mikealbert commented 8 years ago

The version of powa-web is 3.0.2. Here are the installed extensions and versions:

        name        | default_version | installed_version 
--------------------+-----------------+-------------------
 powa               | 3.0.0           | 3.0.0             
 plpgsql            | 1.0             | 1.0               
 pg_stat_kcache     | 2.0.2           | 2.0.2             
 btree_gist         | 1.0             | 1.0               
 pg_stat_statements | 1.2             | 1.2               
 pg_qualstats       | 0.0.7           | 0.0.7

None of the powa parameters are set in our postgresql.conf (other than shared_preload_libraries). Output of the query is below:

 powa=# select datname, count(*) from powa_statements s join pg_database d on d.oid = s.dbid group by 1 order by 2 desc;
 datname  | count
----------+-------
 foo      | 18245
 powa     |    55
 postgres |    41

So far I've mostly been playing with the default intervals. Here's how long each interval took to load:

1 hour  - 27 seconds
1 day   - 2 mins and 30 seconds
1 week  - 7 mins and 5 seconds

Thanks.

rjuju commented 8 years ago

So your database foo has 18k different normalized queries, which explains why it's slow. Does your application use 2PC (prepared transactions, Java XA...)?

mikealbert commented 8 years ago

I'll have to do some more investigation to see how/if we are using transactions for our Rails applications. One thing I noticed is that the bind values don't appear to be getting picked up by POWA. As an example, I see where foo = $1 and the substituted value in our postgres server log, but in POWA it come's across as where foo = ?

mikealbert commented 8 years ago

On second glance, it looks like some binds are showing up, but there are also ? too. Does that represent a hardcoded value in the statement?

mikealbert commented 8 years ago

I double checked and nearly every query from our app is executed as a prepared statement, and I found that the ? is indicative of a parameterized query coming from Ruby on Rails. I did some spot checking in the table powa_statements and confirmed we aren't seeing any literals passed in.

However, what I am seeing is several statements with slight variations which are therefore flagged as unique. As an example,

select foo from bar where x in (?);
select foo from bar where x in (?, ?, ?);
select foo from bar where x in (?, ?);

Providing an example from our environment, the query below for one of our most common tables (I removed the name) returns 1839 rows in powa_statements

select count(*) from powa_statements 
where dbid = '576782'
and query like 'SELECT  "t1".* FROM "t1"  WHERE "t1"."id" IN (%';

rjuju commented 8 years ago

Sorry for late answer.

Yes, the "?" are hardcoded values (normalized by pg_stat_statements), and the $x are the parameters of prepared queries.

So it seems that the root cause of slowness is the large amount of distinct normalized queries, caused by using "IN (...)" with different number of values. Unfortunately, this is an issue in pg_stat_statements, so there's nothing we can do for it in powa, except ignoring these statements :(

FTR, this has already been discussed on -hackers ML, see http://www.postgresql.org/message-id/CAM3SWZSpdPB3uErnXWMt3q74y0r+84ZsOt2U3HKKes_V7O+0Qg@mail.gmail.com (this is also an open TODO item).

mikealbert commented 8 years ago

Thanks for the response and pointing me to that thread on the mailing list. I'll definitely keep an eye on it.

Should the unique statements be purged as part of the config powa.retention? I've noticed that our powa database continues to grow in size even though we have none of the powa parameters set (and thus should be seeing only 1 day of retention).

Thanks again for the help.

rjuju commented 8 years ago

We already thought about cleaning powa_statements table, but it would probably cost more resource than it'd save. This table should always be pretty light, unless you run a lot of different normalized queries everyday. Every 100k saved distinct normalized queries should cost something like 20MB, indexes included. I can provide a query to do such a cleanup manually though.

About the general powa database size, I just checked the default settings, and it appears that a data purge is only performed every 3.4 days (wich will in this case purge 2.4 day of data). There's clearly a problem with this default settings, and we'll fix that soon. In the mean time, you can lower the coalesce parameter to 30, that should trigger a purge every ~ 24h.

For the technical details :

a snapshot is performed every powa.frequency interval (default 5 min)
these snapshots are aggregated every powa.coalesce time (default 100, so every 8 and a few hours)
data older than powa.retention are purged every 10 aggregation (hardcoded to 10).

mikealbert commented 8 years ago

I just checked our powa db and you're right, the powa_statements table is pretty small, so definitely not worth purging.

Looking at growth for the last couple weeks, we are definitely seeing an uptrend so any default purge processes don't seem to be making much impact.

Here are the largest objects in our powa db:

                                relation                                |  size
------------------------------------------------------------------------+---------
 public.powa_statements_history                                         | 295 MB
 public.powa_kcache_metrics                                             | 248 MB
 public.powa_all_relations_history                                      | 161 MB
 public.powa_qualstats_constvalues_history                              | 92 MB
 public.powa_qualstats_quals_history                                    | 51 MB

I'll adjust the coalesce parameter to see if that helps, but I wasn't sure if we were encountering something else. I realize this is a separate issue, so let me know if you'd like me to close out this issue and open a new one.

Thanks for all the help!

rjuju commented 8 years ago

Can you check the date of last maintenance on these tables ? For instance SELECT relname, last_vacuum, last_autovacuum FROM pg_stat_user_tables WHERE relname like 'powa%';

Can you try a VACUUM FULL on powa_statements_history, and check the size before and after ? With default settings, all these tables should purged every few days, the autovacuum with default settings should be enough to avoid growing bloat.

mikealbert commented 8 years ago

I wasn't seeing any maintenance listed in pg_stat_user_tables. VACUUM FULL dramatically shrunk the tables. I noticed just a regular VACUUM didn't reclaim much space. Do you think it would make sense to just schedule the full vacuum to run against all the powa objects?

rjuju commented 8 years ago

Both last_vacuum and last_autovacuum was NULL ?

You can do VACUUM FULL on powa tables, but I'll investigate why autovacuum didn't trigger.

Did you change any powa setting? I assume you at least raised powa.retention.

mikealbert commented 8 years ago

Yeah, both columns were null when I checked. In the interim, we've added a cronjob to our POWA instances to VACUUM FULL the largest tables.

We are still using the default powa settings, so the only item set in our postgresql.conf is shared_preload_libraries.

rjuju commented 8 years ago

Ok, thanks for the feedback. The cron will work but I hope I'll have a better solution for you!

powa-team / powa

Slow Performance for Busy Database #68