arielshaqed commented 1 month ago

What

When enabled, this feature improves performance of concurrent merges.

No seriously, what?

lakeFS retains branch heads on KV. An operation that updates the branch head - typically a commit or a merge - performs a 3-phase commit protocol:

Get the branch head from KV
Create the metarange on block storage based on that branch head
Create a new commit with that metarange whose parent is that branch head, IF the branch head has not moved

When the third phase fails, the operation fails. We do retry automatically on the server, but obviously under load it will still fail. Indeed we see this happen to some users who have heavy merge loads. In this case each retry is actually more expensive, because the gap between the source commit and the destination branch tends to grow with every successive commit to the destination. Failure is pretty much guaranteed once sustained load crosses some critical threshold.

How

This PR is an experiment to use "weak ownership" to avoid having to retry. Broadly speaking, every branch update operation takes ownership of its branch. This excludes all other operations from updating that branch. This exclusion does not reduce performance - only one concurrent branch operation can ever succeed, and almost all concurrent branch operations can succeed. In fact, it increases performance because it prevents wasting resources on all the concurrent failing operations.

What about CAP?

Distributed locking is impossible by the CAP theorem. That's why we use weak ownership. This gives up on C (consistency) and some P (partition resistance) in favour of keeping A (availability). Specifically, a thread takes ownership of a branch by asserting ownership as a record on KV. This ownership has an expiry time; the thread refreshes its ownership until it is done, at which point it deletes the record.

This ownership is weak: usually a single thread owns the branch, but sometimes 2 threads can end up owning the same branch! If the thread fails to refresh on time (due to a partition or slowness or poor clock sync, for instance) then another thread will also acquire ownership! When that happens we have concurrent branch updates. Because these are correct in lakeFS, we lose performance and may need to retry, but we never give incorrect answers.

Results

This PR also adds a lakectl abuse merge command to allow us to measure. b0cfca30f718 has some numbers when running locally: with sustained concurrency 50, we run faster and get 0 failures instead of 6.6% failures. More details there why this is even better than it sounds. Graph available here, here's a spoiler: graph showing speedup when enabling weak ownership

github-actions[bot] commented 1 month ago

:recycle: PR Preview d3520713d637ac29709f27dbe25acc053c8a5ed3 has been successfully destroyed since this PR has been closed.

_{🤖 By surge-preview}

github-actions[bot] commented 1 month ago

E2E Test Results - DynamoDB Local - Local Block Adapter

github-actions[bot] commented 1 month ago

E2E Test Results - Quickstart

yonipeleg33 commented 1 month ago

@arielshaqed looks like amazing work, thank you! I haven't reviewed it yet, but a few general notes:

Do we have an experiment plan for this feature? Which customers/users, for how long, etc.
The term "weak" is a bit inaccurate here IMO: The term "weak" as defined in places like CPP's std::weak_ptr or Rust's std::sync::Weak means a "non-owning reference", which is different than how this mechanism of branch locking works - each worker owns the branch exclusively, but only for a short period, and with a graceful mechanism for giving up exclusive ownership. The term "lease-based ownership" suites better here IMO.

arielshaqed commented 1 month ago

I should probably mention that if we go in this direction, we can deploy this and then improve it. The next step is to add fairness: earlier merge attempts should have a better chance of being next. Otherwise when there are multiple concurrent merges some of the earlier merge will repeatedly lose and time out. If everything is fair then we expect fewer timeouts, and in general the tail latency will shorten. One way to make things fairer is to manage an actual queue of tasks waiting to own the branch. Then either they're thread gets priority to pick up the task (it's allowed to pick up tasks sooner after their expiry, for instance task n in the queue waits until $t{expiry} + n\cdot dt{check}$) or the thread that owns the branch just processes all merges in the queue. Anyway, we don't need that to run this experiment.

arielshaqed commented 1 month ago

@arielshaqed looks like amazing work, thank you! I haven't reviewed it yet, but a few general notes:

Thanks!

Do we have an experiment plan for this feature? Which customers/users, for how long, etc.

Yes. Firstly, this PR adds a lakectl abuse merge command to measure. After we pull I will deploy to Cloud staging, switch it on for a single private installation, and measure. Also as you probably know, we have a user for whom I plan this. I will communicate with them to measure on their system; obviously timing for that depends on the customer timing and how comfortable they feel switching it on.

The term "weak" is a bit inaccurate here IMO: The term "weak" as defined in places like CPP's std::weak_ptr or Rust's std::sync::Weak means a "non-owning reference", which is different than how this mechanism of branch locking works - each worker owns the branch exclusively, but only for a short period, and with a graceful mechanism for giving up exclusive ownership. The term "lease-based ownership" suites better here IMO.

In the C++ and Rust standard libraries, "weak" in this case refers to the pointer. (In other cases in the C++ standard library, "weak" can refer to any variant which is logically weaker, for instance we also have class weak_ordering. In this PR "weak" is supposed to weaken "ownership".

What we have in this PR indeed uses a lease-based implementation. But it's not ownership, it just behaves like ownership in almost all reasonable cases.

A better term might be "mostly correct ownership"[^1]. In fact, I believe that such an unwieldy name would be a good choice! If you agree I will change to use it.

[^1]: The technical term "mostly" here is intended in same sense that Miracle Max uses it.

itaiad200 commented 1 month ago

@arielshaqed could you open an issue and link to it? This is hardly a minor-change :)

arielshaqed commented 1 month ago

@arielshaqed could you open an issue and link to it? This is hardly a minor-change :)

8288 :-)

arielshaqed commented 1 month ago

I really hope kv contract tests cover this

That line no longer occurs directly. We do still have it implied. I believe that it is tested by kvtest here and here.

itaiad200 commented 1 month ago

Sorry for the late feedback, I am reviewing it very cautiously.

I wonder if wrapping BranchUpdate is the appropriate location to "Own" a branch. Every major branch changing operation, like commit & merge has 2 BranchUpdate calls. One for changing the staging token and one for the actual operation. If these 2 operations don't happen one after the other. The current ownership model only wraps a single BranchUpdate call, thus releasing the lock prematurely.

arielshaqed commented 1 month ago

Sorry for the late feedback, I am reviewing it very cautiously.

I wonder if wrapping BranchUpdate is the appropriate location to "Own" a branch. Every major branch changing operation, like commit & merge has 2 BranchUpdate calls. One for changing the staging token and one for the actual operation. If these 2 operations don't happen one after the other. The current ownership model only wraps a single BranchUpdate call, thus releasing the lock prematurely.

I'm not sure this matters. There is no connection between the two update operations: my commit can succeed even if you were the last to seal staffing tokens. And of course sealing is cheap. So I would not expect this to affect performance.

Of course the easy way to be sure will be to perform the experiment.

arielshaqed commented 1 month ago

Testing on lakeFS Cloud (staging)

Setup

Cloud installation on staging (codename popular-mastodon 🐘).
- Boosted it to 500 RCUs + 50 WCUs.
Replaced image with an experimental image built from the branch, tagged v1.38.0-15-g0ec60--hinted-merger.1.
Run the lakectl abuse merge command that this PR provides.
Run from remote (home); this does not matter much because we perform 1.5-2 merges/s.
In one case (* below) I noticed a gateway error so I reran to get better numbers. This was a test with the feature turned off.

Results

Weak ownership	Parallelism	Results
❌	3*	First error: merge from 92: [502 Bad Gateway]: request failed 200 total with 15 failures in 2m0.220660011s 1.663608 successes/s 0.124771 failures/s Gateway failure and 11 branches actually not created. Re-ran.
	3	First error: merge from 88: [423 Locked]: Too many attempts, try again later request failed 200 total with 1 failures in 2m11.462751261s 1.521343 successes/s 0.007607 failures/s
	6	First error: merge from 2: [423 Locked]: Too many attempts, try again later request failed 200 total with 19 failures in 1m49.366288753s 1.828717 successes/s 0.173728 failures/s
	20 (fastest! worst!)	First error: merge from 22: [423 Locked]: Too many attempts, try again later request failed 200 total with 52 failures in 1m36.082933636s 2.081535 successes/s 0.541199 failures/s
✅	3	200 total with 0 failures in 2m1.140085198s 1.650981 successes/s 0.000000 failures/s
	6	200 total with 0 failures in 1m46.569144906s 1.876716 successes/s 0.000000 failures/s
	20	200 total with 0 failures in 1m49.595498987s 1.824892 successes/s 0.000000 failures/s

Analysis

Without weak ownership, contention always caused failures. With concurrency 20, > 25% of merges failed. Each failure actually helps concurrent merges, so failures speed things up.

With weak ownership, we saw no errors. The merge rate improved slightly from 3 to 6, and then stayed flat.

arielshaqed commented 4 weeks ago

Thanks! Pulling... time flies by so quickly when you're having fun...

arielshaqed commented 4 weeks ago

" Run latest lakeFS app on AWS S3 Expected — Waiting for status to be reported "

this is literally finished, today I am not a GitHub fan.

treeverse / lakeFS

Experiment: Improve concurrent merge performance by weakly owning branch updates #8268

What

No seriously, what?

How

What about CAP?

Results

E2E Test Results - DynamoDB Local - Local Block Adapter

E2E Test Results - Quickstart

8288 :-)

Testing on lakeFS Cloud (staging)

Setup

Results

Analysis