superseriousbusiness / gotosocial

Fast, fun, small ActivityPub server.
https://docs.gotosocial.org
GNU Affero General Public License v3.0
3.67k stars 310 forks source link

[performance] Experiment to see if it's possible to optimize database queries using views #374

Closed ForestJohnson closed 1 year ago

ForestJohnson commented 2 years ago

Any database query that has joins between tables might be a good candidate for a database view.

A database view is like a read-only de-normalized table that gets automatically updated when one of the tables that the view is generated from changes.

Here is an example for notifications:

Tobi described the process of selecting notifications like this:

https://github.com/superseriousbusiness/gotosocial/blob/main/internal/db/bundb/notification.go#L57-L108

[paraphrasing the linked code: select * from notifications where target_account_id = ? and id > ? and id < ?]

the table indexes on id, but not on target_account_id, so that would probably help i guess it's also just a bit slow because serializing a notification takes quite a few calls to the db as well you need to fetch the owning account, the target account, and the status the notification pertains to (if it's a status)

Forest sez:

This sounds insane to me. IMO the serialization of the entities should be stateless "pure function", the input to serialization should be DB rows and the output should be JSON or objects or whatever.

The problem is twofold:

  1. its doing a table scan; the where clause includes target_account_id but there's no index along target_account_id on the notifications table.
  2. we need information from the accounts table in order to display the notification (${owning_account.Username} liked your post!) and that information is not on the notification row
    • if it was, that'd mean the same information is written twice in different places, or "de-normalized"
    • de-normalizing by hand can bring problems to developers. for example, what happens if someone changes their username? do we have to update every notification related to them now?

We can fix part 1 by simply adding an index along target_account_id, id to the notifications table. But we can hit both birds with 1 stone with a view.
A view is a nice way to de-normalize because you can think of it as a glorified index, it does not need to be updated by the programmer, it should be updated automatically any time one of its underlying tables is changed.

https://www.tutorialspoint.com/sqlite/sqlite_views.htm

A view is also nice because views can be created, deleted, and updated without interfering with the data/schema in the tables at all, they are derived by the database engine from the view query + the data in the tables.

The query used to generate the view would look something like this:

select 
  notification.id
  notification.target_account_id,
  notification.owning_account_id,
  owning_account.username, owning_account.etc, # include all rows needed to display the notification.
  notification.etc # include all rows needed to display the notification.
from notification
join accounts as owning_account on notification.owning_account_id = owning_account.id
join accounts as target_account on notification.target_account_id = target_account.id
PRIMARY KEY [notification.target_account_id, notification.id]

this way when we run

select * from notification_view where target_account_id = ? and id > ? and id < ?

that where clause will map to a single region of bytes stored on disk in the notification_view the the returned rows will have everything needed to serialize the notification well, except for the status details. But it should be fine to do another request for the status details.

It will probably be better/faster to:

  1. grab all the notification_view rows, then
  2. make a list of the status IDs that you need, then
  3. send a single query to get all those statuses:

select * from status where (id = ? or id = ? or id = ? or id = ? or id = ? or ...)

or maybe

select * from status where id in ?

where ? is a list of IDs

Alternatively if the info needed from the status is not very much / not very large, it might be better to simply add it to the notification view:

select 
  notification.id
  notification.target_account_id,
  notification.owning_account_id,
  owning_account.username, owning_account.etc, # include all account rows needed to display the notification.
  notification.etc, # include all rows needed to display the notification.
  status.url, status.short_preview_text  # include all status rows needed to display the notification.
from notification
join accounts as owning_account on notification.owning_account_id = owning_account.id
join accounts as target_account on notification.target_account_id = target_account.id
left outer join status on notification.status_id = status.id
PRIMARY KEY [notification.target_account_id, notification.id]

IIRC the left outer join means that if notification.status_id is null, it wont omit that notification row, it will just select null for the rows on the status table.

tsmethurst commented 2 years ago

For now I managed to resolve some of the issues with database slowness by adding indexes in appropriate places (see #419), so that's part 1 of the fix. Still didn't experiment with views though so best to leave this issue open and come back to it if/when we need it.

goodfoo commented 1 year ago

I believe you're thinking of materialized views.

SQL light does not support materialized views, only "virtual" views which resolved to normal SQL on the base tables.

https://techdifferences.com/difference-between-view-and-materialized-view.html

tsmethurst commented 1 year ago

Since we haven't really needed to do this, and proper indexing solved a lot of issues we were having, i'm gonna close this (just doing some spring cleaning!)