rerun-io / rerun

Visualize streams of multimodal data. Free, fast, easy to use, and simple to integrate. Built in Rust.
https://rerun.io/
Apache License 2.0
7.08k stars 354 forks source link

Project: Performant visualization of scenes with large number of entities #8233

Open teh-cmc opened 6 days ago

teh-cmc commented 6 days ago

Context

We want the viewer to be able to scale to scenes with large numbers of entities. This of course means visualizing these scenes, but also ingesting them in the first place.

This is blocked on a number of specific implementation issues, but put broadly: the work the viewer has to do to layout a scene more often than not grows linearly with the number of entities present in the entire dataset.

There are only two ways to combat this:

Of course in many cases, option 1 isn't even an option: if the user wants to visualize all entities in the scene, then somehow we have to make that fast.

Incremental caching of aggregated data (which is what the visualizers work with) is very hard, but will be a must in order to reach our performance goals.

This issue is not about:

Measurable end goals

Air Traffic example (2h dataset)

TODO(cmc): What should we do about plotting? Is plotting 10k lines on a single plot really an important use case? If so, do we need to bring egui issues into this?

Revy

Revy was infamously bottlenecked by the performance of many entities (game scenes have a lot of them). This is a good opportunity to revive that project, if we can make it happen.

Relevant material

Writings:

PRs:

Sub-issues

To be replaced with actual sub-issues once/if we have access to them. We can also remove the project-many-entities-perf label at that point.

Areas that need significant improvements

Wherever we don't do something obviously silly, we should strive to go with a retained/cached approach in order to get more scalable and robust against the per-frame regressions in trivially looking (== static frame) scenarios If this is structurally hard, revisit structure!

teh-cmc commented 1 day ago

What does performance look like when we only do the querying + chunk processing part? I.e. let's pretend we've managed to optimized out everything that visualizers do after receiving the chunks (range-zipping, color/radii/etc splatting, annotation context...).

We need to know this because our upcoming aggregated caching plan still involves running the queries every frame.

I've applied the following patch, which (very broadly) simulates that:

Click for diff ```diff diff --git a/crates/viewer/re_space_view/src/results_ext.rs b/crates/viewer/re_space_view/src/results_ext.rs index 55e52be3e00..01d5cae4e17 100644 --- a/crates/viewer/re_space_view/src/results_ext.rs +++ b/crates/viewer/re_space_view/src/results_ext.rs @@ -427,7 +427,7 @@ impl<'a> HybridResultsChunkIter<'a> { pub fn component( &'a self, ) -> impl Iterator)> + 'a { - self.chunks.iter().flat_map(move |chunk| { + self.chunks.iter().filter(|_| false).flat_map(move |chunk| { itertools::izip!( chunk.iter_component_indices(&self.timeline, &self.component_name), chunk.iter_component::(), @@ -441,7 +441,7 @@ impl<'a> HybridResultsChunkIter<'a> { pub fn primitive( &'a self, ) -> impl Iterator + 'a { - self.chunks.iter().flat_map(move |chunk| { + self.chunks.iter().filter(|_| false).flat_map(move |chunk| { itertools::izip!( chunk.iter_component_indices(&self.timeline, &self.component_name), chunk.iter_primitive::(&self.component_name) @@ -458,7 +458,7 @@ impl<'a> HybridResultsChunkIter<'a> { where [T; N]: bytemuck::Pod, { - self.chunks.iter().flat_map(move |chunk| { + self.chunks.iter().filter(|_| false).flat_map(move |chunk| { itertools::izip!( chunk.iter_component_indices(&self.timeline, &self.component_name), chunk.iter_primitive_array::(&self.component_name) @@ -475,7 +475,7 @@ impl<'a> HybridResultsChunkIter<'a> { where [T; N]: bytemuck::Pod, { - self.chunks.iter().flat_map(move |chunk| { + self.chunks.iter().filter(|_| false).flat_map(move |chunk| { itertools::izip!( chunk.iter_component_indices(&self.timeline, &self.component_name), chunk.iter_primitive_array_list::(&self.component_name) @@ -489,7 +489,7 @@ impl<'a> HybridResultsChunkIter<'a> { pub fn string( &'a self, ) -> impl Iterator)> + 'a { - self.chunks.iter().flat_map(|chunk| { + self.chunks.iter().filter(|_| false).flat_map(|chunk| { itertools::izip!( chunk.iter_component_indices(&self.timeline, &self.component_name), chunk.iter_string(&self.component_name) @@ -503,7 +503,7 @@ impl<'a> HybridResultsChunkIter<'a> { pub fn buffer( &'a self, ) -> impl Iterator>)> + 'a { - self.chunks.iter().flat_map(|chunk| { + self.chunks.iter().filter(|_| false).flat_map(|chunk| { itertools::izip!( chunk.iter_component_indices(&self.timeline, &self.component_name), chunk.iter_buffer(&self.component_name) diff --git a/crates/viewer/re_space_view_spatial/src/visualizers/utilities/entity_iterator.rs b/crates/viewer/re_space_view_spatial/src/visualizers/utilities/entity_iterator.rs index 0e535138677..3ba2cd3a26f 100644 --- a/crates/viewer/re_space_view_spatial/src/visualizers/utilities/entity_iterator.rs +++ b/crates/viewer/re_space_view_spatial/src/visualizers/utilities/entity_iterator.rs @@ -141,7 +141,7 @@ pub fn iter_component<'a, C: re_types::Component>( timeline: Timeline, component_name: ComponentName, ) -> impl Iterator)> + 'a { - chunks.iter().flat_map(move |chunk| { + chunks.iter().filter(|_| false).flat_map(move |chunk| { itertools::izip!( chunk.iter_component_indices(&timeline, &component_name), chunk.iter_component::() @@ -158,7 +158,7 @@ pub fn iter_primitive<'a, T: arrow2::types::NativeType>( timeline: Timeline, component_name: ComponentName, ) -> impl Iterator + 'a { - chunks.iter().flat_map(move |chunk| { + chunks.iter().filter(|_| false).flat_map(move |chunk| { itertools::izip!( chunk.iter_component_indices(&timeline, &component_name), chunk.iter_primitive::(&component_name) @@ -178,7 +178,7 @@ pub fn iter_primitive_array<'a, const N: usize, T: arrow2::types::NativeType>( where [T; N]: bytemuck::Pod, { - chunks.iter().flat_map(move |chunk| { + chunks.iter().filter(|_| false).flat_map(move |chunk| { itertools::izip!( chunk.iter_component_indices(&timeline, &component_name), chunk.iter_primitive_array::(&component_name) @@ -198,7 +198,7 @@ pub fn iter_primitive_array_list<'a, const N: usize, T: arrow2::types::NativeTyp where [T; N]: bytemuck::Pod, { - chunks.iter().flat_map(move |chunk| { + chunks.iter().filter(|_| false).flat_map(move |chunk| { itertools::izip!( chunk.iter_component_indices(&timeline, &component_name), chunk.iter_primitive_array_list::(&component_name) @@ -215,7 +215,7 @@ pub fn iter_string<'a>( timeline: Timeline, component_name: ComponentName, ) -> impl Iterator)> + 'a { - chunks.iter().flat_map(move |chunk| { + chunks.iter().filter(|_| false).flat_map(move |chunk| { itertools::izip!( chunk.iter_component_indices(&timeline, &component_name), chunk.iter_string(&component_name) @@ -232,7 +232,7 @@ pub fn iter_buffer<'a, T: arrow::datatypes::ArrowNativeType + arrow2::types::Nat timeline: Timeline, component_name: ComponentName, ) -> impl Iterator>)> + 'a { - chunks.iter().flat_map(move |chunk| { + chunks.iter().filter(|_| false).flat_map(move |chunk| { itertools::izip!( chunk.iter_component_indices(&timeline, &component_name), chunk.iter_buffer(&component_name) ```

First, let's look at unmodified main (on my machine with discrete GPU, i.e. hard mode):

main, latest-at, without plot: image

main, latest-at, with plot: image

main, infinite range for 3D view / latest-at for the rest, without plot: image

main, infinite range for 3D view / latest-at for the rest, with plot: image


Now here's where it gets interesting: consider what happens

main, latest-at, without plot -- Chunk processing only: image image (NOTE: The specific values is the flamegraph are inflated due to probing overhead).

main, infinite range for 3D view / latest-at for the rest, without plot -- Chunk processing only: image


It looks like we can definitely afford to run the queries every frame, as long as we manage to make aggregated caching work.