External data references, promises, larger-than-RAM blob store

teh-cmc commented 1 year ago

Rationale

There are many instances where logging data in our datastore is inefficient at best, or simply unmanageable at worst. (And sometimes just non-sensical: we relog something that is already stored somewhere else?)

A clear instance of this is with video data: logging every frame of a video separately is extremely inefficient memory-wise. This inefficiency stems from the loss of compression benefits between frames, such as those from run encodings. When users try to log multiple 4K streams to Rerun, the task quickly becomes overwhelming.

Instead of this cumbersome process, it would be far more practical for users to point to data stored elsewhere –- be it on a local disk, in cloud storage, on a file server, or any other storage mediums.

Strategy

One approach is to make it so any given Component can serve as either actual data (as it functions now) or as a URI pointing to that data; roughly:

enum Component {
    Data(T),
    Reference(Uri),
}

To put this into perspective with a concrete example, consider video frames:

/// An entity can either be an actual image or a URI that indicates where the image can be found.
enum Image {
    Data(TensorData),

    /// Some examples for clarity:
    /// `http://cloud.storage.com/mybucket/myimage.png`
    /// `ftp://cloud.storage.com/mybucket/myvideo.mp4?frame_id=765398`
    /// `file:///home/Downloads/myvideo.mp4?ts=1648832391`
    External(Uri),
}

While this approach effectively addresses the size and inefficiency concerns, it does introduce a new challenge: the need for prefetching and buffering. Of course, scrubbing randomly will now incur delays (at least for this specific component of this specific entity), which isn't any different from any video player on the web!

Opportunities for extension hooks

Like other subsystems in Rerun, there are opportunities for plugins to customize the behavior of references:

URI Protocols: implement plugins to support new protocols (e.g. git://, http://, ...) or replace existing ones with a custom implementation (e.g. custom logic to fetch an exact frame).
Prefetching and Buffering Logic: Depending on the specific use case, different prefetching and buffering strategies can be developed and implemented.

Moreover, any data that is referred to by URIs needs to be a known Rerun datatype, i.e. data-format plugins are again very relevant here.

Hotswapping data

A nice side-effect of all of this is that the data behind the URI reference can easily be swapped out for something else.

Imagine e.g. that you're working on some computer vision tool for football, and your blueprint always contains a stadium mesh: not only using a URI prevents bloating all your blueprints with the data for that mesh, it also makes it possible to update the mesh remotely:

E.g. Mesh::Uri(http://footballcvanalytics.com/assets/latest/stadium.glb) which redirects to http://footballcvanalytics.com/assets/0.2.1/stadium.glb... for now!

Packaging

On the surface, using URI references seems to disrupt the convenience of creating self-contained rrd archives.

However, it would be entirely feasible to embed external data directly within the rrd archive and then reference it using e.g. a new URI protocol like: rrd://embedded/assets/myvideo.mp4?ts=1648832391.

This would essentially gives us the best of both worlds.

nikolausWest commented 11 months ago

Really love this suggestion! It could really be a great way of enabling recordings that are much bigger than RAM and it also seems quite extensible. One thought / question:

Would it make sense to make the Reference optional rather than making the Component a union. That way, the user could choose to send the data and it's URI (for recreating it) together to avoid the roundtrip of SDK-ref->viewer-ref->URI-data->viewer. When we have the Reference for a component we therefore know we can just GC the Data whenever it makes sense because we can just recreate it later again.

An example use case would be:

User has a fast image processing pipeline that processes images at 600Hz.
They would like to have a live view of what's going on but also be able to scrub back in time and analyze more closely.
They therefore log the output (for example object detections) of the pipeline at full frame rate
The images they instead write to a video file locally (fast) and log:
- Reference + Data at 30Hz
- Reference only for all other frames

emilk commented 10 months ago

We've started calling these external data references promises. They contain some data (e.g. an URI) that lets some plugin find the data for the viewer. A simple promise could be a file-name, a more complicated one an S3 bucket with login credentials.

Promises would be a great solution for the case of Big Data, Small Index. By "Index" we mean "What data was logged when". For instance, when logging a lot of big images, the data grows quickly, but the index stays small. In comparison, when logging millions of scalars per second, both the index and data volumes grow, and a promise would not help at all.

The Promise could either be a datatype or a component.

`Promise` as a datatype

If Promise is a datatype, then the high level index looks the same as if you didn't use a Promise. This means the stream panel look the same for instance, and all heuristics would work as expected. The Promise would be resolved early, so that the visualizers would just see the resolved data, and be ignorant of the fact it was backed by a Promise rather than inline arrow data. This would work well for cases where a single component can be huge (e.g. a TensorData datatype).

There should also be some way to replace a whole component array with a single promise, e.g. replace all the positions in a point cloud with a single Datatype (not quite a splat, but similar!).

`Promise` as a component

If Promise is a component, it could represent a whole entity. It should probably contain the names of the components it will resolve to.

This will produce a very different index for the user. For instance, the Promise component would probably show up in the streams panel.

More here: https://www.notion.so/rerunio/Larger-than-RAM-Seeking-plugins-Promises-and-Resolvers-1dbc3e223d2947db8a8e49cf8773c068?pvs=4

nikolausWest commented 10 months ago

Entity links == `Promise` as component?

Thinking about this a bit more there are a lot of parallels with something else we've been discussing: entity links. If a reference is a component that contains a uri + a list of expected components, that could work as an internal reference as well. If the uri is just an entity path then get the listed components from that entity, if it's a https url then query that url with a list of components as parameters and so on.

The operation of moving data out to a separate blob storage could then consist of adding a reference with a uri to the right place in the external blob store and a list of the components on the entity. We could then have a new GC step that starts by looking at all references, and drops data that can be recreated/fetched first. I'm not sure about the details but maybe this actually shifts it from a separate blob storage to a separate row storage?

bedilbek commented 7 months ago

Related to #5247

rerun-io / rerun

External data references, promises, larger-than-RAM blob store #3119

Rationale