Open sorbaugh opened 6 months ago
The big obvious candidate for sharding would be filecache
as it's both a very big table and used for almost any operation. It is also one of the more difficult tables to add sharding to as various tables get joined onto it in queries, with operations like "get all files shared by a user" selecting a "random" set of rows without any regards of potential sharding keys (the storage
id).
I'm assuming that we want to do the sharding logic in-app without relying on any special sharding "magic" that individual databases might provide?
From my initial understanding we would essentially need to forbid any direct query from directly touching the filecache
and instead try and come up with some other systems to provide apps a way to load data from it in a way that allows sharding.
(there could be some leeway for querying on the filecache
directly as long as the query does an explicit filter on storage
or perhaps on fileid
)
Given the scope and difficulty of having to come up with a system for sharded access to the filecache
that still allows "join like querying" and moving over all apps I'm not sure if the ~3months is doable, especially since we probably don't want such a fundamental low lover change to land late in the dev cycle.
From a search through the code I have locally checked out (core and a good amount, but not all, app) I've found the following queries that would be problematic in various degrees when trying to shard the filecache
: https://gist.github.com/icewind1991/f9583eab9cf80455743812b22664ea0e
Note that these are not all queries that need adjustments, just the queries where adjusting them could cause issues)
I'm assuming that we want to do the sharding logic in-app without relying on any special sharding "magic" that individual databases might provide?
This is my assumption as well since NC is database agnostic and expected behavior should be the same regardless of the underlying database. What do you think @nickvergessen, @juliushaertl ?
Given the scope and difficulty of having to come up with a system for sharded access to the
filecache
that still allows "join like querying" and moving over all apps I'm not sure if the ~3months is doable, especially since we probably don't want such a fundamental low lover change to land late in the dev cycle.
Let's break it down and start with the most obvious ones first as I imagine that would be bring in some additional insights that will help us further. Maybe with these: https://gist.github.com/icewind1991/f9583eab9cf80455743812b22664ea0e#queries-that-join-on-the-filecache-by-id-and-filter-indirectly-on-the-storage
Super helpful gist btw! :)
As bigger and bigger Nextcloud installations are being set up, we need to make efforts and think about how to make Nextcloud more scalable and performant. We should look into database sharding to allow the database to handle larger amounts of data, higher number of transactions, etc.
As a first step it would be good to identify what tables would be best suited for sharding.
EDIT: Linking @icewind1991 's gist with filecache's query analysis: https://gist.github.com/icewind1991/f9583eab9cf80455743812b22664ea0e
Subtasks
oc_fiecache
access