rerun-io / rerun

Visualize streams of multimodal data. Free, fast, easy to use, and simple to integrate. Built in Rust.
https://rerun.io/
Apache License 2.0
6.51k stars 327 forks source link

GC livelocks when temporal data is exhausted #7369

Closed teh-cmc closed 1 month ago

teh-cmc commented 2 months ago

We're in possession of a recording that triggers this behavior: image

The time cursor is paused. Data is not being ingested. This is therefore a bug: the GC is constantly trying to clean up to free space, but somehow it's not doing succeeding.

I haven't been able to reproduce this with a simplified script yet.


How to reproduce with the recording in question:

https://github.com/user-attachments/assets/807fdf3a-1a56-4c43-a53f-17f301a3c33c

teh-cmc commented 2 months ago

I'm 99% sure this is just a edge case I missed in:

I'll be back.

teh-cmc commented 2 months ago

I think I know what's going on. I was looking in the wrong place: this is unrelated to https://github.com/rerun-io/rerun/pull/7182.

The problem stems from a nasty interaction between static data, latest-at query-time caching, and garbage collection.

Look at the following store stats after the recording has just loaded:

Without --memory-limit: image

With --memory-limit 300MiB: image

By the time the GC enters into livelock, the stats will look like this: image

What's happening here is that all the temporal data that could be legally removed, has already been legally removed. Those 39 temporal chunks left that we see in the stats can never go away, since the GC runs with protect_latest=1: we're already at the point where all that's left is one temporal chunk per (entity, timeline, component). The livelock has started: we can run the GC all we want, from this point onwards it will never be able to remove anything.

But if we only have ~40MiB of data in the store, why are we hitting the 300MiB limit and running the GC in the first place?

The problem is that the latest-at cache is organized by query time, as opposed to data time. That is why this cache grows a lot when you scrub the time cursor around: even though we are just caching references to chunks, all these btrees still take a lot of space on their own with this many query timestamps to keep track of. The reason we do this is performance: the viewer runs a number of latest-at queries per frame that grows linearly with the number of entities in the recording, every microsecond we can shave off on the latest-at path matters, it all compounds very quickly by the end of the frame.

Now here's where this gets ugly: if there's nothing to GC, then there are no ChunkStoreEvents being sent, and this there is no cache invalidation going on. The cache will therefore grow indefinitely, while the GC desperately attempts to keeps things under control: livelock.

The fix is to garbage collect the query cache directly when such a situation happens.

ricpruss commented 1 month ago

Can we reopen. I am still seeing this behaviour where rerun does not maintain a memory limit with the latest development build from last night. I am happy to provide a recording for this one if the old recording does not show it.

GCLock

teh-cmc commented 1 month ago

I haven't been able to repro using either main or the latest development build @ricpruss (that is pip install --pre --no-index -f https://build.rerun.io/commit/de49fce/wheels --upgrade rerun-sdk at the time I'm writing this).

I'm using the second RRD file you've provided (memory-too-much.rrd) with a 300MiB memory limit, and putting infinite pressure on the query cache by repeatedly looping the time cursor over the entire timeline: image

teh-cmc commented 1 month ago

Just to make sure, I tried the above on 0.18.2 one more time, and sure enough it very rapidly descends into madness: image

teh-cmc commented 1 month ago

Confirmed fixed, see https://github.com/rerun-io/rerun/pull/7370#issuecomment-2372629935