rerun-io / rerun

Visualize streams of multimodal data. Free, fast, easy to use, and simple to integrate. Built in Rust.
https://rerun.io/
Apache License 2.0
6.7k stars 338 forks source link

Encode images as a byte blob + pixel format + resolution; not as a tensor #6386

Closed teh-cmc closed 3 months ago

teh-cmc commented 6 months ago

Examples of pixel formats:

Also:

Update

Following live discussion, we introduce ImageEncoded, as such:

archetype Image {
  buffer: ImageBuffer,
  resolution: Resolution2D,
  pixel_format: PixelFormat,
  stride: Option<ImageStride>,
}

component ImageStride {
  bytes_per_row: u32,
  bytes_per_plane: Option<u32>,
}

/// e.g. JPEG, PNG, …
archetype ImageEncoded {
  blob: Blob,
  media_type: Option<MediaType>,
}
teh-cmc commented 6 months ago

TensorData stays typed -- but it is completely unrelated to rr.Image now

emilk commented 6 months ago

One user story to consider when designing this is logging a batch of images as a single tensor (very common in the deep learning world).

Wumpf commented 6 months ago

Related to:

emilk commented 6 months ago

Alternative version that still unifies tensors and images:

archetype Tensor {
    buffer: BufferU8,
    shape: TensorShape,
    element_type: TensorElementType,
}

archetype Image {
    buffer: BufferU8,
    shape: TensorShape,
    image_type: PixelFormat,
}

component BufferU8 {
    data: [u8],
}

enum TensorElementType {
    U8,
    U16,
    U32,
    U64,

    I8,
    I16,
    I32,
    I64,

    F16,
    F32,
    F64,
}

enum PixelFormat {
    /* Image formats */
    RGBA8_SRGB_22,
    RG32F,
    NV12,
    // ...

    /* Depth formats */
    F16,
    F32,
    F32_LINEAR_XXX,
    // ...

    /* Segmentation formats */
    U8,
    U16,
    // ...
}
emilk commented 6 months ago

We want to get rid of arrow unions.

Ultimately we want datatype conversions, so we can have a single TensorBuffer component be backed by multiple datatypes (BufferU8, BufferU16, …). However datatype conversions are not here yet.

So what do we do in the meantime? I see two major suggestions

A) We can split tensor into 11 different archetypes and components (TensorU8, TensorU16, etc) as suggested here: https://github.com/rerun-io/rerun/issues/6388

B) We can also use a single U8 buffer for all tensors and then an enum for its type as suggested in https://github.com/rerun-io/rerun/issues/6386#issuecomment-2132895552

A) has the advantage that it uses the correct arrow datatype. However, it has 11x the archetypes and 11x components. B) has the advantage of type-erasure, i.e. "Get me a Tensor" instead of "Get me a Tensor, and it must be F32"


A second discussion: should tensors and images be completely different things, or unified?

jleibs commented 6 months ago

One of the motivators for the tensor refactor work is to make the raw arrow APIs more approachable, esp for ML workflows that are tensor-heavy.

I don't think type erasure is a good thing in that case, as it adds more low-level complexity.

Another thing I remembered is that Arrow has an official variable-shape-tensor spec. It's very close to the new multi-archetype proposal: https://arrow.apache.org/docs/format/CanonicalExtensions.html#variable-shape-tensor if we go down the multi-archetype path I think we should consider aligning with the official spec since it raises the probability that another library that outputs or consumes tensors from arrow will be able to return chunks already in this exact structure.

jleibs commented 6 months ago

Ultimately we want datatype conversions, so we can have a single TensorBuffer component be backed by multiple datatypes (BufferU8, BufferU16, …). However datatype conversions are not here yet.

This also aligns with the multi-archetype proposal. We can define the datatypes now, and because we don't support data type conversions it adds more work on the visualizes to support one-of-many components. Later, we remap to a single component but the arrow-encoded datatypes don't have to change. This will be far easier to manage from both a migration and a backwards compatibility perspective.

emilk commented 6 months ago

https://arrow.apache.org/docs/format/CanonicalExtensions.html#variable-shape-tensor

That reminds me: we should have separate shape: [u64] and dim_names: [str] components:

emilk commented 4 months ago

A concrete design

Problem

We want to support many precise and common pixel formats, like RGB565, NV12, etc.

We also want to easily be able to convert tensors (e.g. numpy arrays) into an Image without having a combinatorial explosion of weird pixel formats that will dilute the PixelFormat enum (BGR_i16 etc).

Solution

The Image archetype must have EITHER a PixelFormat enum with the most common pixel formats, OR the two ColorModel + ElementType components. PixelFormat always wins if it exists.

PixelFormat path

enum PixelFormat {
    RGB_565,
    RGBA_8888,
    NV12,
    YUY2,
    …
}

Fully specifies the ColorModel and Planarity of an image.

See https://facebookresearch.github.io/ocean/docs/images/pixel_formats_and_plane_layout/ for prior art on naming these.

ColorModel and Datatype

If there is no PixelFormat, the viewer will look for two other components:

This spans a huge cartesian product space of possibilities, and makes it easy to convert a tensor or numpy array into an Image.

Having as two enum components means:

In the future we can also add a component here to specify Planarity: Are the RGB components interleaved or in different planes?

Color spaces and gamma curves

More research needed, but at the start we only need to support two things:

enum ColorSpace { sRGB_Linear, sRGB_Encoded }

The linear vs encoded only applies to ColorModel+Datatype.

Summary

component enum PixelFormat {
    RGB_565,
    RGBA_8888,
    NV12,
    YUY2,
    …
}

component enum ColorModel { RGB, RGBA, BGR, BGRA, L, LA, A }

component enum ElementType { U4, U8, U10, U16, U32, F16, F32, F64, … }

archetype Image {
  // Required:
  buffer: ImageBuffer,
  resolution: Resolution2D,

  // If not specified, then `ColorModel` AND `ElementType` MUST be specified
  pixel_format: Option<PixelFormat>,

  color_model: Option<ColorModel>,
  element_type: Option<ElementType>,

  // Optional (and not in first version):
  color_space: Option<ColorSpace>,
  // Largest value, e.g. `1.0` or `255.0`, useful for `ElementType::F32`
  max: Option<f64>,
  stride: Option<ImageStride>,
}

archetype DepthImage {
  // Required:
  buffer: ImageBuffer,
  resolution: Resolution2D,
  element_type: ElementType,

  // Optional (and not in first version):
  stride: Option<ImageStride>,
}

archetype SegmentationImage {
  // Required:
  buffer: ImageBuffer,
  resolution: Resolution2D,
  element_type: ElementType,

  // Optional (and not in first version):
  stride: Option<ImageStride>,
}

archetype Tensor {
    …

  // If set, interpret this tensor as an image.
  color_model: Option<ColorModel>,
}

Discussion

https://rerunio.slack.com/archives/C041NHU952S/p1721031794678379

jleibs commented 4 months ago

The proposed design above introduces quite a bit of complexity:

archetype Image {
  // Required:
  buffer: ImageBuffer,
  resolution: Resolution2D,

  // If not specified, then `ColorModel` AND `ElementType` MUST be specified
  pixel_format: Option<PixelFormat>,

  color_model: Option<ColorModel>,
  element_type: Option<ElementType>,
}

Whereas we previously had a single self-describing TensorData component, we now need to query 5 different components and join them to interpret the contents of the ImageBuffer. There are many more edge cases now allowing for misuse, partial updates, and invalid representations.

There are a few practical benefits to splitting ImageBuffer from the metadata describing the image shape. Examples:

However, we can still get the vast majority of this utility by combining the assorted metadata into a single, required, ImageFormat component:

archetype Image {
  // Required:
  buffer: ImageBuffer,
  format: ImageFormat
}

// More noise from our datamodel.
component ImageFormat {
  datatype ImageFormat;
}

datatype ImageFormat {
  width: uint,
  height: uint,
  row_stride: Option<uint>

  pixel_format: PixelFormat,

  color_model: Option<ColorModel>, // used if pixel_format = ARRAY
  element_type: Option<ElementType> // used if pixel_format = ARRAY
}

component enum PixelFormat {
        ARRAY,
    RGB_565,
    RGBA_8888,
    NV12,
    YUY2,
    …
}

This also simplifies the ImageFormat a bit by explicitly splitting out width and height, as well as making pixel_format always required, with a special value indicating the usage of the last 2 optional fields.

The only practical downside of this change is you can't (yet) just override color_model. You have to override the entirety of the image format.

jleibs commented 3 months ago

Since the above apparently wasn't clear enough in terms of the problems it was trying to address:

If Resolution, Stride, PixelFormat, ColorModel, and Datatype are all separate components, this means they can be updated totally asynchronously and that they all need to be joined together by either a range or latest-at query in order to have all the information to interpret an image buffer.

In turn, every image query must bottom out in a 5-way component join, which, in a pathological case, spans 5 chunks and has potentially different join results depending on the timeline, some of which yield corrupt data interpretation.

Here’s an example: let’s suppose someone first logs an image using nv12 pixel format and then logs an image using rgb color model. If they don’t realize they needed to explicitly clear pixel format (since it’s an optional component), it now joins into their query results implicitly via latest-at, yielding corrupted data.

If everyone only uses our archetype APIs and those APIs are carefully crafted and tested we can generally work around this issue. But this is back to saying our low-level arrow APIs aren’t approachable to users, which is the very problem we’re trying to solve.