powerhouse-inc / document-drive

GNU Affero General Public License v3.0
1 stars 0 forks source link

Queues rework #236

Open CallmeT-ty opened 2 months ago

CallmeT-ty commented 2 months ago

The current implementation of the queues has some issues:

Horizontal scaling

Parallelisation and cpu-heavy work offload to background threads/processes will allow the Document Drive to scale. The main thread shouldn’t block because 10k operations are being serialized and hashed. Queues seem like the best place to add this. Due to the nature of the sync mechanism, even if there are thousands of operations waiting to be added to a document, operations to a different document can be performed at the same time. Synchronization units are a prime target for work distribution but they have dependencies between each other. CREATE_DOCUMENT needs to wait for ADD_FILE on Document drive queue.

Multiple instances of the host app (Switchboard) should only be needed when code external to the Document drive, like Graphql request handling, become the bottleneck.

Optimistic concurrency control

Image

Prisma has a good write-up on this: https://www.prisma.io/docs/orm/prisma-client/queries/transactions#read-modify-write Currently, we have a pessimistic control where a worker waits for a lock on the db and performs all the work inside a db transaction to make sure it is accessing the most up-to-date data. This creates long locks and adds load to the db.

Taking an optimistic approach involves a worker reading the latest state (without locking), applying the new operations and trying to write them to the db. If other operations were added in the meantime then it will fail. If it fails then the worker will just redo the process on the new state. This makes it so that the CPU heavy tasks like serializing and hashing the state, reshuffling operations, running the reducers, etc, are performed on background workers.

We control what operations are applied using the queue manager so conflicts should be rare.

Technical issues

If we use different processes then all communication between the queue manager and the workers has to be through serialized objects. Currently, we pass a callback to a worker so it can write to the db, this is no longer possible.

One simpler alternative is for the worker to return the operations it wants to store and the QueueManager performs the call to the DB. This is not as scalable since everything has to go through the QueueManager (running on the main thread).

Each worker could have it's own connection to the cache and storage layers, allowing it to bypass the QueueManager to complete a job. For example, the current Postgres instance we are using in productions supports 20 connections. This would move the parallel load to the services we use (Redis/Postgres), outside of Switchboard. We could even have these workers running in the cloud with serverless computing.

Node supports multiple processes with Cluster and multiple threads with Child process. In the browser, parallelisation can be added by using Service Workers.

Tasks

Research was done on the API's One queue vs Multiple queues