Less arrow unions - Githubissues

emilk commented 1 month ago

Arrow unions has downsides:

slow serialization/deserialization (no zero-copy)
hard to codegen (especially for Python)
complex for users that want the raw arrow data

Transform

We can split the Transform union-component into an archetype:

archetype Transform {
    /// if set, all other fields are ignored
    mat4: Option<Mat4>,

    /// ignored if mat4 is set
    translation: Option<Translation3>,

    /// scale, rotation, shear. If set, all other fields are ignored.
    mat3: Option<Mat3>,

    /// ignored if mat4 or mat3 is set
    rotation: Option<Rotation3D>,

    /// if set, the uniform `Scale` is ignored
    scale3: Option<Scale3D>,

    /// uniform scale
    scale: Option<Scale>,
}

teh-cmc commented 1 month ago

Punting on datatype conversions by simplifying types

Image (https://github.com/rerun-io/rerun/issues/6386):


archetype Image {
pixel_buffer: PixelBuffer,
pixel_format: PixelFormat,
resolution: Resolution2D,
stride: Option<PixelStride>,
}

enum PixelFormat { / Image formats / RGBA8_SRGB_22, RG32F, NV12, // ...

/* Depth formats */
F16,
F32,
F32_LINEAR_XXX,
// ...

/* Segmentation formats */
U8,
U16,
// ...

}

archetype ImageEncoded { blob: ImageBlob, media_type: Option, }

archetype DepthImage { depth_buffer: PixelBuffer, depth_format: PixelFormat, depth_meter: DepthMeter, resolution: Resolution2D, stride: Option, }

archetype SegmentationImage { buffer: PixelBuffer, buffer_format: PixelFormat, resolution: Resolution2D, stride: Option, }

component PixelStride { bytes_per_row: u32, bytes_per_plane: Option, }


- Tensor:
We generate _archetypes_ and components for all tensor variants (TensorF32, TensorU8, etc) and make sure they share the same indicator.
```rust
archetype TensorU8 {
    buffer: BufferU8,

    // One of these
    shape: TensorShape,
    shape: Vec<TensorDimension>,
}

component BufferU8 {
    data: [u8],
}

archetype TensorF32 {
    buffer: BufferF32,

    // One of these
    shape: TensorShape,
    shape: Vec<TensorDimension>,
}

component BufferF32 {
    data: [f32],
}

Mesh3D (more specifically, the embedded albedo texture): We stay away from data-oriented entity path references because they are A) effectively promises and B) prevent us from knowing what we're querying ahead of time. Nothing changes about the archetype itself (we just remove the TensorData): logging an Image (or an ImageEncoded) at the same path is now the approved :tm: way of setting up an albedo texture. PRs are welcome for SDK-side helpers to do this.

Transform3D:

// Two possibilities:
// - Only legal to set one of them
// - Or apply them all in deterministic order
archetype Transform {
mat4: Option<Mat4>,
translation: Option<Translation3>,
mat3: Option<Mat3>,
rotation: Option<Rotation3D>,
scale3: Option<Scale3D>,
scale: Option<Scale>,
}

AnnotationContext:

// TODO: Separate the skeleton stuff in its own archetype -- figure it out.
archetype AnnotationContext {
class_ids: Vec<ClassId>,
colors: Vec<Color>,
labels: Vec<Text>,
}

Scalar of different sizes: We generate archetypes for all scalar variants (ScalarF32, ScalarU8, etc) and make sure they share the same indicator. At some point, we actually generate templated types in the target languages, if only for sanity.
```
archetype ScalarU8 {
value: ScalarU8,
}
```

component ScalarU8 { value: u8, }

archetype ScalarF32 { value: ScalarF32, }

component ScalarF32 { value: f32, }



### Conclusion

* No datatype conversions
* No heterogeneous cells and stuff
* Massively improves the raw arrow experience :tm:
* No more guessing pixel formats from tensor shapes (or only at a last resort, at the very least)

## Punting on field accessor DSL by simplifying types

- Killing field selection DSL:
Offer ways to ~"augment" chunks~ derive new chunks from existing chunks, adding arbitrary extra columns in the process.
This can happen at log time (SDK-side) or offline or server-side (ingestion time, fetching time), or whenever.
Doesn't matter: the user gets notified of the chunks, and is free to add any list-arrays of their own making.
Example:
* User logs some structural data `{ velocity: f32, konfidence: bool }`.
* User now wants to plot `velocity`, but isn't able to re-log the data for whatever reason.
* User augments the chunk and/or create a new one with a `velocity` column and just does the struct extraction and copy paste the data in a dedicated column.
Update: either augment a chunk or create a new one.
Update 2: always create new one.

### Conclusion

* More powerful than any DSL that is implementable in mid-term
* "Totally valid workaround for not having a DSL" -- Someone said that, names were redacted
* :+1:

## Other random killings

- data-oriented entity references:
We don't do those -- they are akin to promises and necessitate to inspect the data to know the query plan, which is a no-no at the moment.

- blueprint entity references:
Maybe at some point -- doesn't really matter, it's very orthogonal to everything else.
Far simpler than data-oriented references anyhow.

- Clear:
oh god

jleibs commented 1 month ago

Some additional notes on the above:

Why should Images use an untyped buffer + pixel format while tensors use a typed buffer?

While at first glance this proposal might seem to introduce an inconsistency, in practice it serves to highlight the fundamental differences between these two approaches to data representation.

Images are a way of describing a (possibly multi-channel) pixel value over a 2D image plane.

Images are almost always specifically grounded in data received from sensors or sent to displays. This usage, as it relates to special-built hardware, has given rise to pragmatic ways of describing these pixel values more efficiently for purposes of implementation. It is not uncommon for pixel encodings to pack data in ways that simply don't align with a uniform-shape tensor representation. See, chroma subsampling, bayer patterns, etc. It is also quite common to consider an approximate or interpolated pixel value as the data is inherently 2d-spatial.

As such a raw buffer + image encoding really is the most authentic representation we can achieve. For many low-level image libraries or sensor drivers we should be able to directly map this structure to an API that lets us access or load the raw image buffer + some metadata.

On the other hand, Tensors are much more generally mapped to multi-dimensional arrays. They are often used in pure data and computational contexts that have nothing to do with images. Due to the wildly varied applications, the patterns of tensor compression (beyond things like run-length-encoding, or sparse / dense representation) are much more varied and domain specific. This means there simply aren't equivalent forms of tensor-encoding that are as common/applicable as what you see in images. In this case, a strongly typed buffer of primitives dramatically simplifies questions of indexing and tensor-value access. This is the exact approach taken by the Arrow tensor spec (https://arrow.apache.org/docs/format/CanonicalExtensions.html#variable-shape-tensor). Again, most tensor libraries work under this assumption and so feeding a tensor library from a typed buffer + shape will be the most naturally way to work with this data.

What about "RGB" Tensors?

All that said, it's still a very common pattern for a user to decode an image into an HxWxC (or CxHxW) tensor. And this is, in fact, what many users will expect to provide as an input. A numpy ndarray is a tensor -- not an image buffer.

Even for users working with images, whether the user expects to provide an Image (buffer + encoding) or a Tensor (ndarray) will heavily depend on where the user sits in the software stack of their organization.

Rather than fight against this, we may also want to support an "ImageTensor" archetype, which would be a Tensor datatype which we know stores the pixels of an image in one of the common tensor arrangements. This would not support any pixel-encoded images. Only those that had already been decoded into multi-channel tensors.

jleibs commented 1 month ago

Most of the choices for working with tensors fall into one of 4 categories.

Typed buffer, multiple data-types (the proposal)

Pros:

When processing a chunk the raw arrow data is much easier to work with
Opportunity to align with the official arrow spec for tensor representation
Aligns with our long-term direction of wanting to have multiple types and datatype conversions

Cons:

Multi-datatype representation means we must either proliferate typed components or introduce datatype conversions.

The current hypothesis is that proliferating types is a known challenge and can be mostly automated with a mixture of code-gen and some helper code, whereas datatype conversions is an unknown challenge.

Still this puts us on a pathway where once we support multi-typed components, we mostly delete a bunch of code and everything gets simpler. Any type conversions move from visualizer-space to data-query-space, but the types and arrow representations we work with don't actually need to change.

Untyped buffer with type-id

Pros

Avoids arrow unions while maintaining a single datatype.

Cons

Forces arrow users to do annoying user-space datatype casting.
Doesn't align with our long-term goals

Typed buffer with union