[bug/performance] Slow Postgres `COUNT(*)` queries

superseriousbusiness / gotosocial

Fast, fun, small ActivityPub server.

https://docs.gotosocial.org

GNU Affero General Public License v3.0

3.76k stars 322 forks source link

[bug/performance] Slow Postgres `COUNT(*)` queries #1887

Closed mirabilos closed 6 months ago

mirabilos commented 1 year ago

My instance is now up for over a year, but some things have become really slow. For example, the Pinafore frontend times out 3–5 times trying to load my followers (~160) or followees (~200). These classes of slowness seem to be database-related.

I notice, from the GtS log, that you use many COUNT(*) calls, which, on PostgreSQL, must be slow due to its data consistency guarantees.

Looking at the slow queries, many of them are…

SELECT count(*) FROM \"statuses\" AS \"status\" WHERE (\"status\".\"account_id\" = '…')

… which seems to list the number of posts done by each account.

Which Pinafore doesn’t even display.

Which GtS cannot even get because their old toots from before I followed them aren’t backfilled.

So please just use one of the estimate alternatives for that. The number will be slightly off, but not by much, and given the above I doubt anyone would complain overly seriously.

(My statuses table is 375k rows long, and indicēs only help somewhat far but not for this.)

darrinsmart commented 1 year ago

I would have thought PostgreSQL would still be quite fast for that query and row count.

My instance uses sqlite, but as a test I copied the database to PostgreSQL to try the same type of query. It was quick - less than 1ms.

darrin=# explain analyze select count(*) from statuses as status where status.account_id='012345';
                                                                        QUERY PLAN                                                                         
-----------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=75.12..75.13 rows=1 width=8) (actual time=0.634..0.634 rows=1 loops=1)
   ->  Index Only Scan using statuses_account_id_idx on statuses status  (cost=0.42..68.29 rows=2735 width=0) (actual time=0.036..0.376 rows=2855 loops=1)
         Index Cond: (account_id = '012345'::bpchar)
         Heap Fetches: 0
 Planning Time: 0.132 ms
 Execution Time: 0.654 ms
(6 rows)

(I've obfuscated the actual account_id.. but I picked one with a lot of statuses. My "statuses" table has 512k rows)

Are you able to connect to your database with psql and run the query with "explain analyze"?

tsmethurst commented 1 year ago

Which Pinafore doesn’t even display.

Unfortunately it's part of the Mastodon API so we don't really have much of a choice, since some clients do display it... Personally I'd prefer to only show a 'detailed' view of that account when the account itself is viewed specifically (rather than just the account model that's pinned to statuses in timelines etc), but diverging from the mastodon API like that will doubtless break client expectations in ways we can't predict.

We could look at caching those account stats somewhere and invalidating them occasionally so that they remain somewhat up to date (@NyaaaWhatsUpDoc what do you think?), but you'd still have the same issue with slow queries from a cold cache.

Out of interest, which version of Postgres are you using? And which version of GtS?

mirabilos commented 1 year ago

Unfortunately it's part of the Mastodon API

Yeah, grml… ok. But please put in an estimate instead. The linked document explains both why COUNT(*) is necessarily slow in PostgreSQL and ways to estimate.

Out of interest, which version of Postgres are you using? And which version of GtS?

PostgreSQL 13.11 (Debian bullseye), GtS 0.9.0 git-282be6f

Are you able to connect to your database with psql and run the query with "explain analyze"?

Sure.

 Aggregate  (cost=18763.35..18763.36 rows=1 width=8) (actual time=709.544..709.549 rows=1 loops=1)
   ->  Bitmap Heap Scan on statuses status  (cost=101.31..18745.00 rows=7340 width=0) (actual time=7.790..705.969 rows=7435 loops=1)
         Recheck Cond: (account_id = '…'::bpchar)
         Heap Blocks: exact=3339
         ->  Bitmap Index Scan on statuses_account_id_idx  (cost=0.00..99.47 rows=7340 width=0) (actual time=4.000..4.002 rows=7437 loops=1)
               Index Cond: (account_id = '…'::bpchar)
 Planning Time: 4.946 ms
 Execution Time: 709.847 ms

This is of course faster than the first time it was attempted, by GtS itself just seconds before:

@40000000648886540e333f0c timestamp="13/06/2023 15:07:54.237" func=bundb.queryHook.AfterQuery level=WARN duration=3.991218108s query="SELECT count(*) FROM \"statuses\" AS \"status\" WHERE (\"status\".\"account_id\" = '…')" requestID=… msg="SLOW DATABASE QUERY"

Of course now I’m not able to easily produce a 3000+ ms query in psql… but I can retry the exact same one later, when I will not have been using GtS for a bit, or using it a lot but not looking at accounts, or something.

mirabilos commented 1 year ago

Oh hmm. Estimating with a filter (on account) is ofc not as easily possible.

I’d settle for a hack: a config option that makes it just return 0 posts for everyone.

Can GtS distinguish between clients enumerating my followers/followees vs. visiting an individual account’s profile? (Probably not if I know my luck.)

Or maybe cache an estimate count with a profile that’s only updated every once in a while (maybe daily)… (unsure if that would be better… do rows in the accounts table get updated often, or do they remain frozen (I run a daily VACUUM FULL FREEZE ANALYZE)?

mirabilos commented 1 year ago

OK, here’s it now in slow:

psql (13.11 (Debian 13.11-0+deb11u1))
Type "help" for help.

gotosocial=# EXPLAIN ANALYZE SELECT count(*) FROM "statuses" AS "status" WHERE ("status"."account_id" = '…');
                                                                  QUERY PLAN                                                                  
----------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=19255.10..19255.11 rows=1 width=8) (actual time=3293.855..3293.859 rows=1 loops=1)
   ->  Bitmap Heap Scan on statuses status  (cost=107.22..19236.13 rows=7587 width=0) (actual time=8.043..3289.198 rows=7449 loops=1)
         Recheck Cond: (account_id = '…'::bpchar)
         Heap Blocks: exact=3349
         ->  Bitmap Index Scan on statuses_account_id_idx  (cost=0.00..105.33 rows=7587 width=0) (actual time=6.036..6.037 rows=7450 loops=1)
               Index Cond: (account_id = '…'::bpchar)
 Planning Time: 4.214 ms
 Execution Time: 3294.148 ms
(8 rows)

Hrrrm. It does use the index, but it’s still slow given it has to recheck whether the rows are actually visible. And I guess the 1 GiB RAM VM, even if it almost only does GtS, is reaching its limits.

Estimating is probably not possible. Caching the number… unsure. Reporting just 0 toots for everyone as a workaround is still something I’d use on my instance, even if it’s technically wrong, because the numbers usually are wrong anyway.

tsmethurst commented 1 year ago

Ah thanks for catching it again! I still feel like 1GiB should be more than enough, so I reckon we've gotta do some creative thinking about performance improvements for this one.

mirabilos commented 1 year ago

tobi dixit:

Ah thanks for catching it again! I still feel like 1GiB should be more than enough, so I reckon we've gotta do some creative thinking about performance improvements for this one.

Aye. I was thinking up a few (see earlier posts) plus a hack allowing GtS to just lie about it, as a config option.

The problem with COUNT(*) is that it still has to do a table scan after the index scan, to see if the rows returned by the index are idd still alive to the current transaction. (This even came up recently on a PostgreSQL mailing list.) If the rows are very spread across the status table this means a lot of I/O and cache thrashing.

bye, //mirabilos -- Solange man keine schmutzigen Tricks macht, und ich meine wirklich schmutzige Tricks, wie bei einer doppelt verketteten Liste beide Pointer XORen und in nur einem Word speichern, funktioniert Boehm ganz hervorragend. -- Andreas Bogk über boehm-gc in d.a.s.r

barrowsys commented 8 months ago

having the same issue but when requesting v1/timelines/home and v1/notifications. ive got 988k rows in my statuses table for a single user instance running for a year and a half. 30-40 seconds to pull notifications obviously times out and makes the server unusable. DB is on spinning rust but that shouldnt cause this much of an issue.

relevant logs:

level=WARN duration="6.01489226s" query="SELECT count(*) FROM \"statuses\" AS \"status\" WHERE (\"status\".\"account_id\" = '01SQD2')" requestID=29 msg="SLOW DATABASE QUERY"
level=WARN duration="9.54336619s" query="SELECT count(*) FROM \"statuses\" AS \"status\" WHERE (\"status\".\"account_id\" = '01CS73')" requestID=29 msg="SLOW DATABASE QUERY"
level=WARN duration="5.73583652s" query="SELECT count(*) FROM \"statuses\" AS \"status\" WHERE (\"status\".\"account_id\" = '01R4SA')" requestID=29 msg="SLOW DATABASE QUERY"
level=WARN duration="6.52737982s" query="SELECT count(*) FROM \"statuses\" AS \"status\" WHERE (\"status\".\"account_id\" = '01CS73')" requestID=29 msg="SLOW DATABASE QUERY"
level=WARN duration="6.16852539s" query="SELECT count(*) FROM \"statuses\" AS \"status\" WHERE (\"status\".\"account_id\" = '01W1YB')" requestID=29 msg="SLOW DATABASE QUERY"
level=WARN duration="1.83213440s" query="SELECT count(*) FROM \"statuses\" AS \"status\" WHERE (\"status\".\"account_id\" = '01HH26')" requestID=29 msg="SLOW DATABASE QUERY"
level=INFO latency="38.09689888s" userAgent="semaphore firefox ubuntu" method=GET statusCode=200 path=/api/v1/notifications requestID=29 msg="OK: wrote 4.00kiB"

level=WARN duration="2.69849107s" query="SELECT count(*) FROM \"statuses\" AS \"status\" WHERE (\"status\".\"account_id\" = '01SQD2')" requestID=24 msg="SLOW DATABASE QUERY"
level=WARN duration="3.83362171s" query="SELECT count(*) FROM \"statuses\" AS \"status\" WHERE (\"status\".\"account_id\" = '01SFWT')" requestID=24 msg="SLOW DATABASE QUERY"
level=WARN duration="18.1504706s" query="SELECT count(*) FROM \"statuses\" AS \"status\" WHERE (\"status\".\"account_id\" = '01CS73')" requestID=24 msg="SLOW DATABASE QUERY"
level=WARN duration="1.07998329s" query="SELECT count(*) FROM \"statuses\" AS \"status\" WHERE (\"status\".\"account_id\" = '01SQD2')" requestID=24 msg="SLOW DATABASE QUERY"
level=WARN duration="1.82932858s" query="SELECT count(*) FROM \"statuses\" AS \"status\" WHERE (\"status\".\"account_id\" = '01CS73')" requestID=24 msg="SLOW DATABASE QUERY"
level=INFO latency="29.98062525s" userAgent="tusky android" method=GET statusCode=200 path=/api/v1/notifications requestID=24 msg="OK: wrote 4.39kiB"

level=WARN duration="3.55005254s" query="SELECT count(*) FROM \"statuses\" AS \"status\" WHERE (\"status\".\"account_id\" = '01W1YB')" requestID=7z msg="SLOW DATABASE QUERY"
level=WARN duration="2.80329376s" query="SELECT count(*) FROM \"statuses\" AS \"status\" WHERE (\"status\".\"account_id\" = '01HH26')" requestID=7z msg="SLOW DATABASE QUERY"
level=WARN duration="21.3455278s" query="SELECT count(*) FROM \"statuses\" AS \"status\" WHERE (\"status\".\"account_id\" = '01R4SA')" requestID=7z msg="SLOW DATABASE QUERY"
level=INFO latency="30.01671127s" userAgent="tusky android" method=GET statusCode=200 path=/api/v1/notifications requestID=7z msg="OK: wrote 4.00kiB"

and for good measure i explain analyzed one of the account ids that has shown up repeatedly:

gotosocial=> explain analyze SELECT count(*) FROM statuses AS status WHERE status.account_id='01SQD2';
                                                                           QUERY PLAN                               
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=4827.90..4827.91 rows=1 width=8) (actual time=2846.374..2846.375 rows=1 loops=1)
   ->  Index Only Scan using statuses_account_id_idx on statuses status  (cost=0.42..4823.45 rows=1780 width=0) (actual time=32.259..2845.725 rows=1520 loops=1)
         Index Cond: (account_id = '01SQD2'::bpchar)
         Heap Fetches: 885
 Planning Time: 0.236 ms
 Execution Time: 2846.430 ms
(6 rows)

gotosocial=> SELECT count(*) FROM statuses AS status WHERE status.account_id='01SQD2';
 count 
-------
  1520
(1 row)

gotosocial=> explain analyze SELECT count(*) FROM statuses;
                                                                                   QUERY PLAN                                                                                   
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=90399.94..90399.95 rows=1 width=8) (actual time=19776.636..19786.186 rows=1 loops=1)
   ->  Gather  (cost=90399.72..90399.93 rows=2 width=8) (actual time=16679.696..19786.165 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=89399.72..89399.73 rows=1 width=8) (actual time=17446.534..17446.537 rows=1 loops=3)
               ->  Parallel Index Only Scan using statuses_local_idx on statuses  (cost=0.42..88369.49 rows=412093 width=0) (actual time=15.383..17409.784 rows=329383 loops=3)
                     Heap Fetches: 755086
 Planning Time: 0.231 ms
 Execution Time: 19804.337 ms
(9 rows)

gotosocial=> SELECT count(*) FROM statuses;
 count  
--------
 988150
(1 row)

the queries do get faster if theyve just been run but this postgres db does enough other stuff that it usually doesnt cache for very long. occasionally if the rest of the server settles i can manage to grab notifications on the second or third try,,,which is still a poor experience. i could fiddle with the database to try to speed this up but that seems like a fools errand if this is a case of postgres being asked to do something it isnt designed to do.

tsmethurst commented 8 months ago

It's berserk to me that postgres struggles so hard with this when sqlite handles it no problem. Very frustrating.

As stated above, we can't use the estimate function because we need to add a WHERE clause to select the appropriate account ID. And besides, the estimate function apparently relies on ANALYZE or VACUUM having been run recently, which we can't guarantee since some db admins might have it turned off for whatever reason. And having a config option that just ignores counts is hacky and bad, I don't think that's worth considering.

I'll have another look around when time permits and see if there's another approach that would work. Other sensible suggestions welcome in the meantime.

mirabilos commented 8 months ago

Ezra Barrow dixit:

Execution Time: 19804.337 ms

Huh, that’s impressive.

I assume you VACUUM ANALYZE regularily.

What saved my instance was a move to a differently spec’d VM. It’s got 4 GiB RAM instead of just one, which I assume makes all the difference, but it’s also got SSD-backed storage in‐ stead of a rather slow NAS, and on the original one, there was too much background activity after all (including an I/O- heavy daily cronjob that tar’d up gigabytes).

Not sure if that’s a possibility for you.

From an application developer side, I’d probably put the counts into a table by themselves and either update them along or, preferably, update them only occasionally and serve “incorrect” (too low) counts. (Then save also the max ID of the last update, so the next one needs only count rows added since, and (to account for deletions) only do full rescans very occasionally or as an idle task when there is currently no activity.) That needs quite some dev effort, of course… so I can understand not getting this done any time soon. Just faking the numbers, as a config option for instances where users are ok with just getting 0, is probably easier to do.

bye, //mirabilos

mirabilos commented 8 months ago

tobi dixit:

It's berserk to me that postgres struggles so hard with this when sqlite handles it no problem. Very frustrating.

Different base design choices.

The problem here is that when you ask PostgreSQL for an exact number, it goes through all the pain to give you that exact number in precise and correct. This means not only scanning in the index, but then also checking in the pages on disc whether the tuples are indeed still alive, as a different transaction from another session could have committed a deletion recently. That’s all from being a proper database server system with MVCC and a historical focus on correctness over even speed.

SQLite does not even have a concept of multiple session, or clients in general, because it runs embedded into one process. (I’ve seen webapp devs using it, becoming curious about long wait times, which turned out to be waiting until another runner unlocked (by closing) the database file.)

If you could keep these counts as a simple numeric field associated with the relevant user, that would probably work best. (While not an experienced DBA, I’d be leaning towards putting it into a different table than accounts because that data changes vastly more often than the fields in the accounts table.)

bye, //mirabilos

tsmethurst commented 8 months ago

Multiple processes can open an sqlite db file at the same time, and indeed this is what we do in GtS with WAL mode and multiple open connections. https://www.sqlite.org/faq.html#q5

But anyway that's beside the point of this issue.

@barrowsys Perhaps you could give more information about your setup? Is the database running on a separate machine from GtS, for instance? Is it possible that the Postgres in-memory caches are getting put into swap memory?

barrowsys commented 8 months ago

postgres and gts are on the same machine, along with lots of other db clients. its a mildly busy box, the caches likely get cleared by nature of that. we do have plenty of memory free, we only use a portion of the 16G for applications, the rest the OS does as it wishes with.

it seems silly to need a dedicated postgres install for a single user instance, especially when the only reason is a field rarely used by clients.

another potential solution is to provide the post counts on a best effort basis? add a timeout to the db queries and return 0 or -1 if it doesnt return in time, so the user still knows theres an error but doesnt timeout their basic requests.

@mirabilos

I assume you VACUUM ANALYZE regularly.

nothing beyond the default postgres configuration (which should autovacuum?). just a big dedicated server running nowhere near capacity. trying a vacuum analyze right now, will report back if that changes anything.

tsmethurst commented 8 months ago

Mmm, okay that all sounds fine indeed. That's similar to the gts.superseriousbusiness.org setup, which also has a bunch of stuff running on it (including other db clients). And the drive itself? Is that on an ssd? And what's the swappiness of the machine? Normally for a database server / server running a database on it you'll want a swappiness of 10 or less. We've seen in the past that people running with default swappiness of some OS's get their database performance tanked as the OS swaps database caches from memory into swap file, which is much, much slower.

Just to clarify btw, I'm not saying we don't also have work to do to tune these queries and stuff, just looking to see if maybe there's a bottleneck on the deployment that can also be eased in the meantime.

barrowsys commented 8 months ago

drive is two HDDs in raid 0. swappiness is 10. HDDs are somewhat old but we're still a few months out from being able to replace them.

running a VACUUM ANALYZE statuses; does seem to have alleviated the issue enough to be usable. timeline and notification requests generally return in reasonable time now. no clue whether it was the vacuum or the analyze that did it, but iiuc the default autovacuum settings should be keeping the table vacuumed, so presumably the analyze? even a straight count(*) FROM statuses completes in 800ms when not recently cached.

there are still several related log warnings from the last 12 hours but the frequency and severity of them is reduced. still over a second for these single queries, but this is every warning in the logs from that time period rather than a small selection.

timestamp="09/02/2024 00:42:34.122" duration="1.86s" query="SELECT count(*) FROM \"statuses\" AS \"status\" WHERE (\"status\".\"local\" = TRUE)" requestID=r30 msg="SLOW DATABASE QUERY"
timestamp="09/02/2024 01:10:21.846" duration="1.12s" query="SELECT count(*) FROM \"statuses\" AS \"status\" WHERE (\"status\".\"account_id\" = '01Q3T1')" requestID=acg msg="SLOW DATABASE QUERY"
timestamp="09/02/2024 01:14:43.673" duration="1.20s" query="SELECT count(*) FROM \"statuses\" AS \"status\" WHERE (\"status\".\"account_id\" = '01R4SA')" requestID=s4g msg="SLOW DATABASE QUERY"
timestamp="09/02/2024 01:50:29.739" duration="1.02s" query="SELECT count(*) FROM \"statuses\" AS \"status\" WHERE (\"status\".\"local\" = TRUE)" requestID=830 msg="SLOW DATABASE QUERY"
timestamp="09/02/2024 02:01:12.212" duration="1.20s" query="SELECT count(*) FROM \"statuses\" AS \"status\" WHERE (\"status\".\"account_id\" = '01A12Q')" requestID=jhg msg="SLOW DATABASE QUERY"
timestamp="09/02/2024 02:30:16.433" duration="1.32s" query="SELECT count(*) FROM \"statuses\" AS \"status\" WHERE (\"status\".\"account_id\" = '01VEHR')" requestID=heg msg="SLOW DATABASE QUERY"
timestamp="09/02/2024 07:25:41.718" duration="1.14s" query="SELECT count(*) FROM \"statuses\" AS \"status\" WHERE (\"status\".\"account_id\" = '01CS73')" requestID=81g msg="SLOW DATABASE QUERY"
timestamp="09/02/2024 09:38:06.392" duration="1.03s" query="SELECT count(*) FROM \"statuses\" AS \"status\" WHERE (\"status\".\"account_id\" = '01FKNM')" requestID=2pg msg="SLOW DATABASE QUERY"
timestamp="09/02/2024 11:42:38.726" duration="1.30s" query="SELECT count(*) FROM \"statuses\" AS \"status\" WHERE (\"status\".\"account_id\" = '01FKNM')" requestID=5qg msg="SLOW DATABASE QUERY"
timestamp="09/02/2024 12:48:36.117" duration="1.35s" query="SELECT count(*) FROM \"statuses\" AS \"status\" WHERE (\"status\".\"account_id\" = '01YYBP')" requestID=1q0 msg="SLOW DATABASE QUERY"

thanks for the clarification, sorry for my hostility.

mirabilos commented 8 months ago

Ezra Barrow dixit:

running a VACUUM ANALYZE statuses; does seem to have alleviated the issue enough to be usable. timeline and notification requests generally return in reasonable time now. no clue whether it was the vacuum or the analyze that did it

Hmm hmm.

, but iiuc the default autovacuum settings should be keeping the table

I’m not sure how many people trust the autovacuum though.

vacuumed, so presumably the analyze? even a straight count(*) FROM statuses completes in 800ms when not recently cached.

Maybe the compaction.

AIUI it is that the index tells it which rows to consider, and then it needs to check the rows themselves to see whether they are still alive, so if they are packed together more tightly, it’s a lesser number of pages to check.

Which brings me to another idea. Have a table with just an FK on accounts and an FK on statuses. (Unfortunately these can’t be BIGNUM because GtS uses strings as PKs, otherwise, the effect would be even better.) Then do the queries on that.

But that would be even more effort to keep in sync than storing a per-account number or two. OK, so forget it…

Meow, //mirabilos

NyaaaWhatsUpDoc commented 8 months ago

the PR i pushed should help, but beyond a certain point it feels like we're just providing workarounds for suboptimal hosting conditions for a gotosocial postgres setup. due to the nature of it being out-of-process, and being optimized more for scaling up / horizontally (as evident by how it does count queries), and with how important database latency is to gotosocial, postgres installations will naturally require more cpu grunt and / or more connection parameter tuning to get it working optimally.

tsmethurst commented 8 months ago

Ah it's cool that the query times are reduced even on HDDs with the ANALYZE :)

What we found with SQLite recently is that unless you do regular ANALYZEs, then the query planner sometimes struggles to understand the best way to perform a certain query, because it simply doesn't have enough information to form a good picture of the table. We've now implemented regular ANALYZE calls in SQLite to alleviate this.

But of course that doesn't solve the Postgres-specific problem; running ANALYZE regularly on Postgres is something that DB operators need to schedule themselves, I believe. (Can't remember where we talked about this last, but iirc that's the conclusion we came to.)

So yes anyway, I'm with @NyaaaWhatsUpDoc in the sense that at some point we can't really do anything else but implement workarounds. Inspired by this issue we have a pull request open now that should alleviate some of these problems: https://github.com/superseriousbusiness/gotosocial/pull/2620

But beyond that, the only sensible step (as @mirabilos suggested) would be to rearchitect the database schemas in order to store counts and stuff in an account_stats table or something like that, to avoid running into these slow COUNT query issues. That seems like a decent approach, but it's not a priority immediately.

mirabilos commented 8 months ago

kim dixit:

the PR i pushed should help, but beyond a certain point it feels like we're just providing workarounds for suboptimal hosting conditions for a gotosocial postgres setup. due to the nature of it being

Yes and no.

out-of-process, and being optimized more for scaling up / horizontally (as evident by how it does count queries), and with how important database latency is to gotosocial, postgres installations will naturally require more cpu grunt and / or more connection parameter tuning to get it working optimally.

Developers always have to program to the database up to a certain amount. You couldn’t use this schema with, say, a column-based DB or a document store either. So accounting that COUNT is naturally slow on PostgreSQL in the order of size that GtS has needs to be done, ideally by keeping the counts as data field, then perhaps lazily updating that.

tsmethurst commented 8 months ago

@mirabilos I believe that's the second or third time in this thread now that you've suggested the same thing. The point has been taken already.

mirabilos commented 8 months ago

tobi dixit:

But of course that doesn't necessarily solve the Postgres problem; running ANALYZE regularly on Postgres is something that DB operators need to schedule themselves, I believe.

I think so as well. I even do FREEZE vacuums so the blocks get packed better (though I stopped regularily doing FULL as that locks the tables for long and is not necessary so often):

$ cat /etc/cron.d/zzzlocal 2 7 * postgres /usr/bin/env LANG=C LC_CTYPE=C LC_ALL=C psql -c 'VACUUM FREEZE ANALYZE;' gotosocial 2>&1 | logger -t gtscron-vacuum

That seems like a decent approach, but it's not a priority immediately.

Fully agreed.

In the longer term, several storage-related cleanups would be welcome, as we’ll hit scaling issues eventually, but for now, and with the duplicate indexes having been removed, things can work, just not with more than one of few RAM and concurrent I/O, and not with DB on slow storage (you’ve already added a warning about that Hetzner thing, but some VM providers have naturally slower backing storage for the VM HDDs than others).

barrowsys commented 8 months ago

after giving it some more time the issue has managed to come up again even with the VACUUM ANALYZE. it still can take a few tries to get anything back after a while offline but once the various caches get hot they stay hot long enough to be usable. i appreciate the help, let me know if theres any other data i can gather to help get this solved.

tsmethurst commented 8 months ago

the issue has managed to come up again

Ah that's a pity. I'm not sure what else could be done at this point, though this will be alleviated somewhat in 0.14.0 with the PR mentioned above.

Anyway, I'll make a separate issue (tomorrow, probably) to remind us to -- at some point -- move account stats into a separate table, and will link back to this issue there.