rerun-io / rerun

Visualize streams of multimodal data. Free, fast, easy to use, and simple to integrate. Built in Rust.
https://rerun.io/
Apache License 2.0
6.67k stars 336 forks source link

Proposal: split `ClearIsRecursive` into `ClearFlat` and `ClearRecursive` #7630

Open teh-cmc opened 1 month ago

teh-cmc commented 1 month ago

Some thoughts on Clears as I'm pulling my hair implementing support for them in the dataframe APIs.

Context

Clears today are implemented with two components:

The potentially recursive nature of Clears make them both very complex and very costly to resolve. The reason for this is that they require data-driven queries, as opposed to metadata-driven queries. That's not good. To resolve a Clear, you need to find all Chunks that could potentially (depending on the contents of the cell) contain tombstones relevant for the entity at hand.

Say you have an entity /a/b/c and you want to know whether it has been cleared at some time t:

That's pretty bad already, but it gets even worse in a dataframe context, where all of the above has to be happen as part of a larger, complicated streaming join that involves an arbitrary number of entities. It's terrible. It is also probably fair to assume that it gets even worse-er in a disk/network-based storage environment.

In practice, the only sane way to do all of the above (especially so in a dataframe context), is to fetch all the potentially relevant Chunks into RAM, and then do all the processing on that. But as mentioned at the start, Clears actually take a non-negligible amount of space, so fetching them all into RAM can actually become a serious issue with real-world datasets. I.e. it is complex, slow and memory-intensive all at once. :+1:

In fact, it's so complex that I've just realized the existing code in re_query is wrong, and nobody's ever noticed.

Proposal

Split ClearIsRecursive into ClearFlat and ClearRecursive. ClearFlat and ClearRecursive are NullArrays (ListArray<NullArray> once chunkified).

This gets rid of the data-driven queries, and demotes them back down to good old metadata-driven queries. In the future, we could probably just use a tag instead of two separate components, since tags are part of the column metadata.

The /a/b/c case study from above just becomes a matter of A) fetching all chunks that contain either a ClearFlat or a ClearRecursive column for /a/b/c, and then fetching all the chunks that contain a ClearRecursive column for /a and /a/b. Once these chunks are densified (which the query APIs do on your behalf), then it just becomes a matter of looking at the time column. That fixes all the complexity issues.

That doesn't address the size issues though (it removes the bitmap, which is something I guess, but we still need the outer ListArray).

Impact on public APIs

Very likely none at all? The ClearIsRecursive object goes away, which is a breaking change of course, but in practice we've never really advertised anywhere: we expose nice helpers instead.

rr.Clear(recursive=False)
rerun::Clear::flat()
rerun::Clear::FLAT

We should be able to keep these helpers working as-is.

Impact on public ABI

This obviously breaks it. It is possible to write an automatic migration tool if we deem it worthy enough.

Should this be part of 0.19?

~TBD~ Probably not

FAQ

Can we just support data-driven queries?

Supporting data-driven queries means supporting data-driven indices. Even if we ignore the giant can of worms in the room, the ChunkStore is fast because it doesn't even look at the chunks' data, let alone index it.

Of course we will support data-driven indices in the data platform, but that's a completely different matter AFAIC.

Should ClearFlat / ClearRecursive basically become ControlColumns so they don't even need the ListArray wrappers?

The issue is that we need some kind of validity bitmap to know which rows have a Clear and which don't, so either it's a listarray<nullarray> (because nullarray doesnt have a bitmap) or it could be a boolarray but then you have to explain somewhere that the values() are irrelevant and only the bitmap matters or something.

You could also say "clears always go in their own chunk", but once again that's introducing weird arbitrary rules.

Really what you'd want is a UnitArray I guess, which is effectively just a validity bitmap, but there is no such thing.

teh-cmc commented 1 month ago

I'm realizing that what constitutes "metadata" in the context of Rerun might not even be that obvious.

So here's my take on it: metadata is anything that can be inspected at the Chunk layer (I guess the correct terminology would be "Chunk-level metadata"):