Closed arielshaqed closed 4 weeks ago
:recycle: PR Preview d3520713d637ac29709f27dbe25acc053c8a5ed3 has been successfully destroyed since this PR has been closed.
🤖 By surge-preview
@arielshaqed looks like amazing work, thank you! I haven't reviewed it yet, but a few general notes:
I should probably mention that if we go in this direction, we can deploy this and then improve it. The next step is to add fairness: earlier merge attempts should have a better chance of being next. Otherwise when there are multiple concurrent merges some of the earlier merge will repeatedly lose and time out. If everything is fair then we expect fewer timeouts, and in general the tail latency will shorten. One way to make things fairer is to manage an actual queue of tasks waiting to own the branch. Then either they're thread gets priority to pick up the task (it's allowed to pick up tasks sooner after their expiry, for instance task n in the queue waits until $t{expiry} + n\cdot dt{check}$) or the thread that owns the branch just processes all merges in the queue. Anyway, we don't need that to run this experiment.
@arielshaqed looks like amazing work, thank you! I haven't reviewed it yet, but a few general notes:
Thanks!
- Do we have an experiment plan for this feature? Which customers/users, for how long, etc.
Yes. Firstly, this PR adds a lakectl abuse merge
command to measure. After we pull I will deploy to Cloud staging, switch it on for a single private installation, and measure. Also as you probably know, we have a user for whom I plan this. I will communicate with them to measure on their system; obviously timing for that depends on the customer timing and how comfortable they feel switching it on.
- The term "weak" is a bit inaccurate here IMO: The term "weak" as defined in places like CPP's std::weak_ptr or Rust's std::sync::Weak means a "non-owning reference", which is different than how this mechanism of branch locking works - each worker owns the branch exclusively, but only for a short period, and with a graceful mechanism for giving up exclusive ownership. The term "lease-based ownership" suites better here IMO.
In the C++ and Rust standard libraries, "weak" in this case refers to the pointer. (In other cases in the C++ standard library, "weak" can refer to any variant which is logically weaker, for instance we also have class weak_ordering
. In this PR "weak" is supposed to weaken "ownership".
What we have in this PR indeed uses a lease-based implementation. But it's not ownership, it just behaves like ownership in almost all reasonable cases.
A better term might be "mostly correct ownership"[^1]. In fact, I believe that such an unwieldy name would be a good choice! If you agree I will change to use it.
[^1]: The technical term "mostly" here is intended in same sense that Miracle Max uses it.
@arielshaqed could you open an issue and link to it? This is hardly a minor-change
:)
@arielshaqed could you open an issue and link to it? This is hardly a
minor-change
:)
Sorry for the late feedback, I am reviewing it very cautiously.
I wonder if wrapping BranchUpdate
is the appropriate location to "Own" a branch. Every major branch changing operation, like commit & merge has 2 BranchUpdate
calls. One for changing the staging token and one for the actual operation. If these 2 operations don't happen one after the other.
The current ownership model only wraps a single BranchUpdate
call, thus releasing the lock prematurely.
Sorry for the late feedback, I am reviewing it very cautiously.
I wonder if wrapping
BranchUpdate
is the appropriate location to "Own" a branch. Every major branch changing operation, like commit & merge has 2BranchUpdate
calls. One for changing the staging token and one for the actual operation. If these 2 operations don't happen one after the other. The current ownership model only wraps a singleBranchUpdate
call, thus releasing the lock prematurely.
I'm not sure this matters. There is no connection between the two update operations: my commit can succeed even if you were the last to seal staffing tokens. And of course sealing is cheap. So I would not expect this to affect performance.
Of course the easy way to be sure will be to perform the experiment.
popular-mastodon
🐘).
v1.38.0-15-g0ec60--hinted-merger.1
.lakectl abuse merge
command that this PR provides.Weak ownership | Parallelism | Results |
---|---|---|
❌ | 3* | First error: merge from 92: [502 Bad Gateway]: request failed 200 total with 15 failures in 2m0.220660011s 1.663608 successes/s 0.124771 failures/sGateway failure and 11 branches actually not created. Re-ran. |
3 | First error: merge from 88: [423 Locked]: Too many attempts, try again later request failed 200 total with 1 failures in 2m11.462751261s 1.521343 successes/s 0.007607 failures/s | |
6 | First error: merge from 2: [423 Locked]: Too many attempts, try again later request failed 200 total with 19 failures in 1m49.366288753s 1.828717 successes/s 0.173728 failures/s | |
20 (fastest! worst!) | First error: merge from 22: [423 Locked]: Too many attempts, try again later request failed 200 total with 52 failures in 1m36.082933636s 2.081535 successes/s 0.541199 failures/s | |
✅ | 3 | 200 total with 0 failures in 2m1.140085198s 1.650981 successes/s 0.000000 failures/s |
6 | 200 total with 0 failures in 1m46.569144906s 1.876716 successes/s 0.000000 failures/s | |
20 | 200 total with 0 failures in 1m49.595498987s 1.824892 successes/s 0.000000 failures/s |
Without weak ownership, contention always caused failures. With concurrency 20, > 25% of merges failed. Each failure actually helps concurrent merges, so failures speed things up.
With weak ownership, we saw no errors. The merge rate improved slightly from 3 to 6, and then stayed flat.
Thanks! Pulling... time flies by so quickly when you're having fun...
" Run latest lakeFS app on AWS S3 Expected — Waiting for status to be reported "
What
When enabled, this feature improves performance of concurrent merges.
No seriously, what?
lakeFS retains branch heads on KV. An operation that updates the branch head - typically a commit or a merge - performs a 3-phase commit protocol:
When the third phase fails, the operation fails. We do retry automatically on the server, but obviously under load it will still fail. Indeed we see this happen to some users who have heavy merge loads. In this case each retry is actually more expensive, because the gap between the source commit and the destination branch tends to grow with every successive commit to the destination. Failure is pretty much guaranteed once sustained load crosses some critical threshold.
How
This PR is an experiment to use "weak ownership" to avoid having to retry. Broadly speaking, every branch update operation takes ownership of its branch. This excludes all other operations from updating that branch. This exclusion does not reduce performance - only one concurrent branch operation can ever succeed, and almost all concurrent branch operations can succeed. In fact, it increases performance because it prevents wasting resources on all the concurrent failing operations.
What about CAP?
Distributed locking is impossible by the CAP theorem. That's why we use weak ownership. This gives up on C (consistency) and some P (partition resistance) in favour of keeping A (availability). Specifically, a thread takes ownership of a branch by asserting ownership as a record on KV. This ownership has an expiry time; the thread refreshes its ownership until it is done, at which point it deletes the record.
This ownership is weak: usually a single thread owns the branch, but sometimes 2 threads can end up owning the same branch! If the thread fails to refresh on time (due to a partition or slowness or poor clock sync, for instance) then another thread will also acquire ownership! When that happens we have concurrent branch updates. Because these are correct in lakeFS, we lose performance and may need to retry, but we never give incorrect answers.
Results
This PR also adds a
lakectl abuse merge
command to allow us to measure. b0cfca30f718 has some numbers when running locally: with sustained concurrency 50, we run faster and get 0 failures instead of 6.6% failures. More details there why this is even better than it sounds. Graph available here, here's a spoiler: