GC livelocks when temporal data is exhausted

teh-cmc commented 2 months ago

We're in possession of a recording that triggers this behavior:

The time cursor is paused. Data is not being ingested. This is therefore a bug: the GC is constantly trying to clean up to free space, but somehow it's not doing succeeding.

I haven't been able to reproduce this with a simplified script yet.

How to reproduce with the recording in question:

rerun --memory-limit 300MiB rec.rrd
open memory panel
gently scrub the time cursor through the entire timeline, back and forth, a bunch of time (we want to fill the latest-at cache with a bunch of stuff
switch between timelines and repeat the scrubbing
continue this process until the GC goes into livelock

https://github.com/user-attachments/assets/807fdf3a-1a56-4c43-a53f-17f301a3c33c

teh-cmc commented 2 months ago

I'm 99% sure this is just a edge case I missed in:

https://github.com/rerun-io/rerun/pull/7182

I'll be back.

teh-cmc commented 2 months ago

I think I know what's going on. I was looking in the wrong place: this is unrelated to https://github.com/rerun-io/rerun/pull/7182.

The problem stems from a nasty interaction between static data, latest-at query-time caching, and garbage collection.

Look at the following store stats after the recording has just loaded:

Without --memory-limit:

With --memory-limit 300MiB:

By the time the GC enters into livelock, the stats will look like this:

What's happening here is that all the temporal data that could be legally removed, has already been legally removed. Those 39 temporal chunks left that we see in the stats can never go away, since the GC runs with protect_latest=1: we're already at the point where all that's left is one temporal chunk per (entity, timeline, component). The livelock has started: we can run the GC all we want, from this point onwards it will never be able to remove anything.

But if we only have ~40MiB of data in the store, why are we hitting the 300MiB limit and running the GC in the first place?

The problem is that the latest-at cache is organized by query time, as opposed to data time. That is why this cache grows a lot when you scrub the time cursor around: even though we are just caching references to chunks, all these btrees still take a lot of space on their own with this many query timestamps to keep track of. The reason we do this is performance: the viewer runs a number of latest-at queries per frame that grows linearly with the number of entities in the recording, every microsecond we can shave off on the latest-at path matters, it all compounds very quickly by the end of the frame.

Now here's where this gets ugly: if there's nothing to GC, then there are no ChunkStoreEvents being sent, and this there is no cache invalidation going on. The cache will therefore grow indefinitely, while the GC desperately attempts to keeps things under control: livelock.

The fix is to garbage collect the query cache directly when such a situation happens.

ricpruss commented 1 month ago

Can we reopen. I am still seeing this behaviour where rerun does not maintain a memory limit with the latest development build from last night. I am happy to provide a recording for this one if the old recording does not show it.

GCLock

teh-cmc commented 1 month ago

I haven't been able to repro using either main or the latest development build @ricpruss (that is pip install --pre --no-index -f https://build.rerun.io/commit/de49fce/wheels --upgrade rerun-sdk at the time I'm writing this).

I'm using the second RRD file you've provided (memory-too-much.rrd) with a 300MiB memory limit, and putting infinite pressure on the query cache by repeatedly looping the time cursor over the entire timeline:

What memory limit are you using in your test?
Can you take a screenshot of the information shown in the about menu?
Can you also copy the output of rerun --version?

teh-cmc commented 1 month ago

Just to make sure, I tried the above on 0.18.2 one more time, and sure enough it very rapidly descends into madness:

teh-cmc commented 1 month ago

Confirmed fixed, see https://github.com/rerun-io/rerun/pull/7370#issuecomment-2372629935

rerun-io / rerun

GC livelocks when temporal data is exhausted #7369