superseriousbusiness / gotosocial

Fast, fun, small ActivityPub server.
https://docs.gotosocial.org
GNU Affero General Public License v3.0
3.85k stars 339 forks source link

[feature] Optimize S3 cache fixes by using `ListObjects` instead of `StatObject` #3482

Open tsmethurst opened 1 month ago

tsmethurst commented 1 month ago

Specifically for the media and emoji fixCacheState function called during cleanup, we call c.state.Storage.Has to check if each file that we should have stored actually exists on the storage backend, to do automatic healing of cache state.

This works great for filesystem storage where a stat call is relatively cheap, but for S3 it can result in many HTTP calls to the S3 backend in a row, which can be expensive both network-wise, and money-wise (as some S3 providers charge for the volume of such calls).

To make this process a little less resource-consuming, we should consider ways to optimize this, perhaps by making fewer ListObjects calls to get the names of the files in the bucket, cache that list somewhere (in a temporary db table or in a file in /tmp and use that as a reference to see if the object exists in storage.

tsmethurst commented 1 month ago

From chat:

nki: Perhaps you can do the reverse of attachment-on-db -> s3 lookup, instead list the objects and query for the media_attachment on the db instead? Might be cheaper and since we have sql on the db side we can be more flexible kim: yeah that's what i was thinking essentially, doing batches of list-objects and querying the databse for them nki: iirc you can only list like 1000 objects or something each call anyway, so it probably works out for batching