Open teh-cmc opened 1 year ago
Really love this suggestion! It could really be a great way of enabling recordings that are much bigger than RAM and it also seems quite extensible. One thought / question:
Would it make sense to make the Reference
optional rather than making the Component a union. That way, the user could choose to send the data and it's URI (for recreating it) together to avoid the roundtrip of SDK-ref->viewer-ref->URI-data->viewer. When we have the Reference
for a component we therefore know we can just GC the Data
whenever it makes sense because we can just recreate it later again.
An example use case would be:
We've started calling these external data references promises. They contain some data (e.g. an URI) that lets some plugin find the data for the viewer. A simple promise could be a file-name, a more complicated one an S3 bucket with login credentials.
Promises would be a great solution for the case of Big Data, Small Index. By "Index" we mean "What data was logged when". For instance, when logging a lot of big images, the data grows quickly, but the index stays small. In comparison, when logging millions of scalars per second, both the index and data volumes grow, and a promise would not help at all.
The Promise
could either be a datatype or a component.
Promise
as a datatypeIf Promise
is a datatype, then the high level index looks the same as if you didn't use a Promise. This means the stream panel look the same for instance, and all heuristics would work as expected. The Promise would be resolved early, so that the visualizers would just see the resolved data, and be ignorant of the fact it was backed by a Promise rather than inline arrow data. This would work well for cases where a single component can be huge (e.g. a TensorData
datatype).
There should also be some way to replace a whole component array with a single promise, e.g. replace all the positions in a point cloud with a single Datatype
(not quite a splat, but similar!).
Promise
as a componentIf Promise
is a component, it could represent a whole entity. It should probably contain the names of the components it will resolve to.
This will produce a very different index for the user. For instance, the Promise
component would probably show up in the streams panel.
Promise
as component?Thinking about this a bit more there are a lot of parallels with something else we've been discussing: entity links. If a reference is a component that contains a uri + a list of expected components, that could work as an internal reference as well. If the uri is just an entity path then get the listed components from that entity, if it's a https url then query that url with a list of components as parameters and so on.
The operation of moving data out to a separate blob storage could then consist of adding a reference with a uri to the right place in the external blob store and a list of the components on the entity. We could then have a new GC step that starts by looking at all references, and drops data that can be recreated/fetched first. I'm not sure about the details but maybe this actually shifts it from a separate blob storage to a separate row storage?
Related to #5247
Rationale
There are many instances where logging data in our datastore is inefficient at best, or simply unmanageable at worst. (And sometimes just non-sensical: we relog something that is already stored somewhere else?)
A clear instance of this is with video data: logging every frame of a video separately is extremely inefficient memory-wise. This inefficiency stems from the loss of compression benefits between frames, such as those from run encodings. When users try to log multiple 4K streams to Rerun, the task quickly becomes overwhelming.
Instead of this cumbersome process, it would be far more practical for users to point to data stored elsewhere –- be it on a local disk, in cloud storage, on a file server, or any other storage mediums.
Strategy
One approach is to make it so any given Component can serve as either actual data (as it functions now) or as a URI pointing to that data; roughly:
To put this into perspective with a concrete example, consider video frames:
While this approach effectively addresses the size and inefficiency concerns, it does introduce a new challenge: the need for prefetching and buffering. Of course, scrubbing randomly will now incur delays (at least for this specific component of this specific entity), which isn't any different from any video player on the web!
Opportunities for extension hooks
Like other subsystems in Rerun, there are opportunities for plugins to customize the behavior of references:
git://
,http://
, ...) or replace existing ones with a custom implementation (e.g. custom logic to fetch an exact frame).Moreover, any data that is referred to by URIs needs to be a known Rerun datatype, i.e. data-format plugins are again very relevant here.
Hotswapping data
A nice side-effect of all of this is that the data behind the URI reference can easily be swapped out for something else.
Imagine e.g. that you're working on some computer vision tool for football, and your blueprint always contains a stadium mesh: not only using a URI prevents bloating all your blueprints with the data for that mesh, it also makes it possible to update the mesh remotely:
E.g.
Mesh::Uri(http://footballcvanalytics.com/assets/latest/stadium.glb)
which redirects tohttp://footballcvanalytics.com/assets/0.2.1/stadium.glb
... for now!Packaging
On the surface, using URI references seems to disrupt the convenience of creating self-contained
rrd
archives.However, it would be entirely feasible to embed external data directly within the
rrd
archive and then reference it using e.g. a new URI protocol like:rrd://embedded/assets/myvideo.mp4?ts=1648832391
.This would essentially gives us the best of both worlds.