Closed teh-cmc closed 1 month ago
I'm 99% sure this is just a edge case I missed in:
I'll be back.
I think I know what's going on. I was looking in the wrong place: this is unrelated to https://github.com/rerun-io/rerun/pull/7182.
The problem stems from a nasty interaction between static data, latest-at query-time caching, and garbage collection.
Look at the following store stats after the recording has just loaded:
Without --memory-limit
:
With --memory-limit 300MiB
:
By the time the GC enters into livelock, the stats will look like this:
What's happening here is that all the temporal data that could be legally removed, has already been legally removed.
Those 39 temporal chunks left that we see in the stats can never go away, since the GC runs with protect_latest=1
: we're already at the point where all that's left is one temporal chunk per (entity, timeline, component)
.
The livelock has started: we can run the GC all we want, from this point onwards it will never be able to remove anything.
But if we only have ~40MiB of data in the store, why are we hitting the 300MiB limit and running the GC in the first place?
The problem is that the latest-at cache is organized by query time, as opposed to data time. That is why this cache grows a lot when you scrub the time cursor around: even though we are just caching references to chunks, all these btrees still take a lot of space on their own with this many query timestamps to keep track of. The reason we do this is performance: the viewer runs a number of latest-at queries per frame that grows linearly with the number of entities in the recording, every microsecond we can shave off on the latest-at path matters, it all compounds very quickly by the end of the frame.
Now here's where this gets ugly: if there's nothing to GC, then there are no ChunkStoreEvent
s being sent, and this there is no cache invalidation going on.
The cache will therefore grow indefinitely, while the GC desperately attempts to keeps things under control: livelock.
The fix is to garbage collect the query cache directly when such a situation happens.
Can we reopen. I am still seeing this behaviour where rerun does not maintain a memory limit with the latest development build from last night. I am happy to provide a recording for this one if the old recording does not show it.
I haven't been able to repro using either main
or the latest development build @ricpruss (that is pip install --pre --no-index -f https://build.rerun.io/commit/de49fce/wheels --upgrade rerun-sdk
at the time I'm writing this).
I'm using the second RRD file you've provided (memory-too-much.rrd
) with a 300MiB memory limit, and putting infinite pressure on the query cache by repeatedly looping the time cursor over the entire timeline:
rerun --version
?Just to make sure, I tried the above on 0.18.2
one more time, and sure enough it very rapidly descends into madness:
Confirmed fixed, see https://github.com/rerun-io/rerun/pull/7370#issuecomment-2372629935
We're in possession of a recording that triggers this behavior:
The time cursor is paused. Data is not being ingested. This is therefore a bug: the GC is constantly trying to clean up to free space, but somehow it's not doing succeeding.
I haven't been able to reproduce this with a simplified script yet.
How to reproduce with the recording in question:
rerun --memory-limit 300MiB rec.rrd
https://github.com/user-attachments/assets/807fdf3a-1a56-4c43-a53f-17f301a3c33c