treeverse / lakeFS

lakeFS - Data version control for your data lake | Git for data
https://docs.lakefs.io
Apache License 2.0
4.46k stars 359 forks source link

Experiment: Reduce contention on branch HEADs to improve merge performance #8288

Closed arielshaqed closed 4 weeks ago

arielshaqed commented 1 month ago

What

Under heavy load, lakeFS merges to the same branch slow each other down and can cause failure. One manifestation of such failures is in merges and other commit operations failing as "Locked".

Experiment with reducing contention on branch HEADs to improve performance and reduce error rates.

Branch updates

Any operation that changes the branch head is an update. These include merges, commits, reverts, and cherrypicks. The KV implementation of safe branch updates is actually simple:

Doing this ensures that the update is correct: it explicitly serializes all branches to the HEAD. However it also means that if more than one thread try to update the same branch concurrently, only one will succeed. So the work of all other threads is wasted. This includes the resources that these operations take, which include access to the KV and to the backing store, local disk IOPs, and network.

Reducing contention

Large-scale concurrent updates are mostly created by automated proceses (humans cannot create a large number of concurrent operations), and therefore it is reasonable to expect most of them to succeed. This experiment is to reduce contention and measure by how much it improves merge performance.

arielshaqed commented 1 month ago

Alternatives

We can do other things! Reducing contention is the cheapest experiment. If we do it we can still add better alternatives later.

Here are some alternatives in order of increasing difficulty of implementation.

Reduce work of retrying merges

Specific to merges. Every time a merge needs to be retried, its source will typically be further away from the destination. After all, some other concurrent merge affected the source, and if the user is not wasting too much effort then different merges do not perform the same changes.

So writing the merged metarange requires more effort on every retry. This is the opposite of fair! Whenever a merge retries it takes longer to write its new metarange. Meanwhile other newer merges can start from a closer source - so they have less work to do.

Rather than abandon the metarange from the failed attempt, we could:

Why not now?

Reduce contention, but fairly

We can additionally require some fairness. For instance, we might roughly queue branch updates. This will reduce the time spent by updates that take longer. I would expect it to affect p90 or p95 performance, but affect less the average time to wait.

Why not now?

This includes the proposed experiment and more. So it is more work.

Clearly doing this is harder than merely reducing contention. As an example, see the difference in implementation complexity between a mutex and a fair mutex.

Reduce contention with cooperating processes

Fairness implies some collection of pending branch update operations, typically a queue or something with similar characteristics. Rather than merely blocking one another, operations can advance the queue. In this version, branch updates are queued up. Workers scan these queues and perform the operations queued-up on each branch, in order. Contention is reduced because a worker can hold onto a branch for a long time and perform multiple operations without blocking anything. The web service threads only enqueue work and wait for it to be done.

This version has very convenient operational characteristics. Only branch update workers race to handles a branch, and it does not matter which one wins. Once a worker starts handling a branch it can keep going until it runs out of work or decides to handle another branch. So fairness is simple. Adding branch update workers causes some contention on the queues, but not on the operations.

Why not now

This includes the proposed experiment and more. So it is more work.

We need an additional queuing system, and possibly a mailbox so the operation can block until a worker handles its request.