Open teh-cmc opened 1 month ago
I'm realizing that what constitutes "metadata" in the context of Rerun might not even be that obvious.
So here's my take on it: metadata is anything that can be inspected at the Chunk layer (I guess the correct terminology would be "Chunk-level metadata"):
Some thoughts on
Clear
s as I'm pulling my hair implementing support for them in the dataframe APIs.Context
Clear
s today are implemented with two components:ClearIndicator
marker. It's a ZST, it's cheap, and it'll be replaced by tagged components anyhow. It's fine.ClearIsRecursive
component that represents the actual tombstone, backed by aBooleanArray
, which indicates whether the tombstone should apply recursively or not. In practice, thatBooleanArray
is always unary length: clears are always applied with clamping semantics. Also in practice: thatBooleanArray
is always wrapped in aListArray
during the chunking process, since this is just a component like any other. Moral of the story: these tombstones can actually take a non-negligible amount of physical space if you start having a lot of them, since you're paying both for the boolean bitmaps and the list-array offsets. Sadly enough, this is the least of our worries.The potentially recursive nature of
Clear
s make them both very complex and very costly to resolve. The reason for this is that they require data-driven queries, as opposed to metadata-driven queries. That's not good. To resolve aClear
, you need to find allChunk
s that could potentially (depending on the contents of the cell) contain tombstones relevant for the entity at hand.Say you have an entity
/a/b/c
and you want to know whether it has been cleared at some timet
:Chunk
stored on/a/b/c
itself and that contains aClearIsRecursive
column will unconditionally affect/a/b/c
, regardless of whether the tombstone is recursive or not. I.e. you do not need to look at the data itself. All you need is to run aLatestAt(/a/b/c, ClearIsRecursive, t)
query and check the data-time of the result, and you're done. Great, we know how do to that efficiently.Clear
s for both/a
and/a/b
, and that's where your problems start: you effectively need to computeLatestAt(/a/b/c, ClearIsRecursive(true), t)
( :warning: note the(true)
), which is a data-driven query, which is something we cannot do (certainly not efficiently, at least). The only way to do that is to start an unbounded backwards walk through the chunks, deserialize and inspect the data, and only stop once a recursive clear has actually been found.That's pretty bad already, but it gets even worse in a dataframe context, where all of the above has to be happen as part of a larger, complicated streaming join that involves an arbitrary number of entities. It's terrible. It is also probably fair to assume that it gets even worse-er in a disk/network-based storage environment.
In practice, the only sane way to do all of the above (especially so in a dataframe context), is to fetch all the potentially relevant
Chunk
s into RAM, and then do all the processing on that. But as mentioned at the start,Clear
s actually take a non-negligible amount of space, so fetching them all into RAM can actually become a serious issue with real-world datasets. I.e. it is complex, slow and memory-intensive all at once. :+1:In fact, it's so complex that I've just realized the existing code in
re_query
is wrong, and nobody's ever noticed.7631
Proposal
Split
ClearIsRecursive
intoClearFlat
andClearRecursive
.ClearFlat
andClearRecursive
areNullArray
s (ListArray<NullArray>
once chunkified).This gets rid of the data-driven queries, and demotes them back down to good old metadata-driven queries. In the future, we could probably just use a tag instead of two separate components, since tags are part of the column metadata.
The
/a/b/c
case study from above just becomes a matter of A) fetching all chunks that contain either aClearFlat
or aClearRecursive
column for/a/b/c
, and then fetching all the chunks that contain aClearRecursive
column for/a
and/a/b
. Once these chunks are densified (which the query APIs do on your behalf), then it just becomes a matter of looking at the time column. That fixes all the complexity issues.That doesn't address the size issues though (it removes the bitmap, which is something I guess, but we still need the outer
ListArray
).Impact on public APIs
Very likely none at all? The
ClearIsRecursive
object goes away, which is a breaking change of course, but in practice we've never really advertised anywhere: we expose nice helpers instead.We should be able to keep these helpers working as-is.
Impact on public ABI
This obviously breaks it. It is possible to write an automatic migration tool if we deem it worthy enough.
Should this be part of 0.19?
~TBD~ Probably not
FAQ
Can we just support data-driven queries?
Supporting data-driven queries means supporting data-driven indices. Even if we ignore the giant can of worms in the room, the
ChunkStore
is fast because it doesn't even look at the chunks' data, let alone index it.Of course we will support data-driven indices in the data platform, but that's a completely different matter AFAIC.
Should ClearFlat / ClearRecursive basically become ControlColumns so they don't even need the ListArray wrappers?
The issue is that we need some kind of validity bitmap to know which rows have a
Clear
and which don't, so either it's alistarray<nullarray>
(because nullarray doesnt have a bitmap) or it could be aboolarray
but then you have to explain somewhere that thevalues()
are irrelevant and only the bitmap matters or something.You could also say "clears always go in their own chunk", but once again that's introducing weird arbitrary rules.
Really what you'd want is a
UnitArray
I guess, which is effectively just a validity bitmap, but there is no such thing.