superseriousbusiness / gotosocial

Fast, fun, small ActivityPub server.
https://docs.gotosocial.org
GNU Affero General Public License v3.0
3.82k stars 331 forks source link

[feature] split off database columns with low cardinality #3345

Open mirabilos opened 1 month ago

mirabilos commented 1 month ago

Is your feature request related to a problem ?

I’ve just seen this in the migration:

timestamp="2024-09-24T19:16:19.197Z" func=migrations.init.73.func1.1 level=INFO msg="adding column 'interaction_policy' to 'statuses'..."
timestamp="2024-09-24T19:16:19.201Z" func=migrations.init.73.func1.1 level=INFO msg="adding column 'pending_approval' to 'statuses'..."
timestamp="2024-09-24T19:16:19.203Z" func=migrations.init.73.func1.1 level=INFO msg="adding column 'approved_by_uri' to 'statuses'..."
timestamp="2024-09-24T19:16:19.205Z" func=migrations.init.73.func1.1 level=INFO msg="adding column 'pending_approval' to 'status_faves'..."
timestamp="2024-09-24T19:16:19.208Z" func=migrations.init.73.func1.1 level=INFO msg="adding column 'approved_by_uri' to 'status_faves'..."
timestamp="2024-09-24T19:16:19.210Z" func=migrations.init.73.func1.1 level=INFO msg="adding column 'interaction_policy_direct' to 'account_settings'..."
timestamp="2024-09-24T19:16:19.213Z" func=migrations.init.73.func1.1 level=INFO msg="adding column 'interaction_policy_mutuals_only' to 'account_settings'..."
timestamp="2024-09-24T19:16:19.215Z" func=migrations.init.73.func1.1 level=INFO msg="adding column 'interaction_policy_followers_only' to 'account_settings'..."
timestamp="2024-09-24T19:16:19.217Z" func=migrations.init.73.func1.1 level=INFO msg="adding column 'interaction_policy_unlocked' to 'account_settings'..."
timestamp="2024-09-24T19:16:19.219Z" func=migrations.init.73.func1.1 level=INFO msg="adding column 'interaction_policy_public' to 'account_settings'..."

The row length of the statuses table was already off the roof before, it’s going to be even more now even if four older columns were then dropped.

I naïvely estimate the cardinality of these very low: only a very small percentage of all posts from all instances will have interaction policies (other than the default) set, and for those who have it set, perhaps there’s even a common set (e.g. for incoming posts, the approvals columns will likely always be empty(?), and perhaps for one user, all of their posts will have the same interaction policy).

So maybe move them into a separate table and reference that table’s rows where needed, and perhaps even deduplicate (i.e. if the split-off table already has a matching row, use that instead of adding an identical one).

Perhaps other columns from all the tables in here could also be candidates for such splitting. Interaction policies are possibly a more interesting target because they wouldn’t need to be loaded for when just displaying a toot, I think… but even if things are going to be loaded, the DB can JOIN them in for you easily, so no extra queries, and with smaller rowlen the DB’s job gets easier in general as well.

Describe the solution you'd like.

See above.

Describe alternatives you've considered.

NONE

Additional context.

No response

tsmethurst commented 1 month ago

Ah yeah there's a few such tables we could probably split off in this way (as we did recently for the account settings). Will keep this in mind as an optimization.

NyaaaWhatsUpDoc commented 1 month ago

Yep definitely a good idea, thank you @mirabilos :). I already have some other optimizations in mind post v0.17.0 release cycle, as we still have a number of enum types being stored as strings in the database.

mirabilos commented 1 month ago

kim dixit:

Yep definitely a good idea, thank you @mirabilos :). I already have some other optimizations in mind post v0.17.0 release cycle, as we

Cool!

The removal of the post body content from boosts has (after a full vacuum) freed up about 1 GiB as well. Not sure whether moving the content-related columns to a different table would be worth it though, as PostgreSQL stores those TEXT fields out of line anyway.

still have a number of enum types being stored as strings in the database.

As to strings, I also have an idea for the future. I am considering writing a PostgreSQL extension to add a field type ULID, in a way that could be completely transparent to GtS, then I could switch all the PK and FK fields from CHAR(26) to ULID, which could be stored as 16 inline bytes, but GtS would not need to change anything. I am not entirely sure whether this is even possible, but if it is, both tables and indicēs would profit. I will let you know, but I don’t have a timeframe for this yet, either. It would be something the DB admin has to install separately, anyway, as C extension, so GtS would need to work without, too.

tsmethurst commented 1 month ago

Oooh that's exciting! 👀