treeverse / lakeFS

lakeFS - Data version control for your data lake | Git for data
https://docs.lakefs.io
Apache License 2.0
4.4k stars 349 forks source link

Compaction: Common prefix listing on a branch is linear in # of uncommitted entries #2092

Open ozkatz opened 3 years ago

ozkatz commented 3 years ago

common prefix listing that includes uncommitted changes on a branch (i.e. first screen seen in the UI when clicking on a repo) is actually o(n) where n is the amount of uncommitted changes.

The problem here is recognizing common prefixes that were deleted. Given an underlying commit and a list of uncommitted change, the way to tell whether a common prefix exists, is by making sure not all entries within it are tombstoned in the list of uncommitted changes.

See this: https://github.com/treeverse/lakeFS/blob/v0.42.0/pkg/graveler/combined_iterator.go#L87 This is the iterator that combines a set of changes (iterA) and an underlying commit (iterB). Assuming prefix "a/" contains 1m entries in the underlying commit, and 999,999 tombstones in the diff, this iterator will scan all of them when calling Next().

Not sure if current data model allows doing anything better, but probably worth exploring and understanding the tradeoff of doing something "better".

arielshaqed commented 2 years ago

Not sure if current data model allows doing anything better, but probably worth exploring and understanding the tradeoff of doing something "better".

Introducing... range deletes

tl;dr: Allow tombstones to span an entire range. A range tombstone marks that all entries in the underlying commit from a to b are deleted.

This is actually an old idea, I cannot find it explicitly in the BigTable paper, but it appears at least e.g. in Cassandra. And every SSTable-based DB seems to have one, e.g. RocksDB range tombstones

Once introduced it is easy to come up with rules for managing delete ranges; it is cheap enough to add them that we can do everything online. Call every tombstone a range delete, just of size 1 (i.e. a tombstone for key K is a tombstone for the range [K, K || '\0'), where || '\0' is appending a NUL to K.

We can also directly add range deletes, probably as a really fast prefix delete (rm --recursive, as performed on the client side in #2529) -- this can happen in constant time!

Note that support for partial commits (#2512) inevitable interacts with range deletions. Nothing impossible, rules will make sense, but we shall have to be aware of both of these solutions.

Range deletes are good for iteration

Object listings are a merge between iteration on committed HEAD and on staging (both with a O(log n) SeekGE operation). So whenever we see a tombstone we skip the committed past its its end.

This is still linear in # of uncommitted entries, but that was never really the point. We are now almost linear in output size. Specifically time is O(n + d log N) where n is the output size, N is the size of the entire commit, and d is the number of range deletions on staging. (We can make it O(n + d log n) if the SSTable library can use previous iterator hints correctly, but that probably barely matters...).

arielshaqed commented 2 years ago

Also relevant: #2466. But ranges are quite easy to support on a sorted-key KV: use either start or end of the range as a key, store the other inside the value, and because there is never any overlap between ranges or keys in the staging area, the application logic can do the rest.

ozkatz commented 2 years ago

@arielshaqed This is a great idea and I think deserves a proposal on its own right!

Just clarifying: this doesn't solve the issue (or at least not all occurrences of it), at least not without some client integration (i.e. if s3a still deletes partitions object-wise, or if the AWS SDKs do the same), It does, however, open the door to implementing those.

arielshaqed commented 2 years ago

@arielshaqed This is a great idea and I think deserves a proposal on its own right!

Given that the range deletes proposal is intended solely to resolve the problem of many tombstones, I think I like it here :-)

Just clarifying: this doesn't solve the issue (or at least not all occurrences of it), at least not without some client integration (i.e. if s3a still deletes partitions object-wise, or if the AWS SDKs do the same), It does, however, open the door to implementing those.

Not sure that this is true. This proposal does NOT add a "range delete" API. Instead it adds a new(-ish) datatype to store in the Graveler staging area. Specifically this on how to delete:

  • When adding an object at key K, check the previous staging element Sp. proceeds to explain how to extend any adjacent delete ranges.

So any S3 client that deletes all elements of a range (in any order) will end up generating the exact same range delete tombstone. (And as noted in the proposal, this is ~a stolen~ not a particularly novel idea.)

github-actions[bot] commented 11 months ago

This issue is now marked as stale after 90 days of inactivity, and will be closed soon. To keep it, mark it with the "no stale" label.

github-actions[bot] commented 11 months ago

Closing this issue because it has been stale for 7 days with no activity.

arielshaqed commented 11 months ago

Still linear, we just have better batching. This will become an issue, I would rather keep it open.

arielshaqed commented 9 months ago

One issue with the current kv implementation is that we have concurrent adds... and there is no easy to lock a region of the key space. If one thread merges 2 tombstones while another is adding an item between them, we lose. So tombstones can only be compacted on a sealed token. (No problem with concurrent commits, we just might have multiple tombstones for the same item during a compaction.) When do we seal and compact? Any time is good, but not too often. I would probably count the number of tombstones crafted adjacently to one another, or at all, and use that as a hint.