Closed teh-cmc closed 3 months ago
TensorData
stays typed -- but it is completely unrelated to rr.Image
now
One user story to consider when designing this is logging a batch of images as a single tensor (very common in the deep learning world).
Alternative version that still unifies tensors and images:
archetype Tensor {
buffer: BufferU8,
shape: TensorShape,
element_type: TensorElementType,
}
archetype Image {
buffer: BufferU8,
shape: TensorShape,
image_type: PixelFormat,
}
component BufferU8 {
data: [u8],
}
enum TensorElementType {
U8,
U16,
U32,
U64,
I8,
I16,
I32,
I64,
F16,
F32,
F64,
}
enum PixelFormat {
/* Image formats */
RGBA8_SRGB_22,
RG32F,
NV12,
// ...
/* Depth formats */
F16,
F32,
F32_LINEAR_XXX,
// ...
/* Segmentation formats */
U8,
U16,
// ...
}
We want to get rid of arrow unions.
Ultimately we want datatype conversions, so we can have a single TensorBuffer
component be backed by multiple datatypes (BufferU8
, BufferU16
, …).
However datatype conversions are not here yet.
So what do we do in the meantime? I see two major suggestions
A) We can split tensor into 11 different archetypes and components (TensorU8
, TensorU16
, etc) as suggested here: https://github.com/rerun-io/rerun/issues/6388
B) We can also use a single U8 buffer for all tensors and then an enum for its type as suggested in https://github.com/rerun-io/rerun/issues/6386#issuecomment-2132895552
A) has the advantage that it uses the correct arrow datatype. However, it has 11x the archetypes and 11x components. B) has the advantage of type-erasure, i.e. "Get me a Tensor" instead of "Get me a Tensor, and it must be F32"
A second discussion: should tensors and images be completely different things, or unified?
One of the motivators for the tensor refactor work is to make the raw arrow APIs more approachable, esp for ML workflows that are tensor-heavy.
I don't think type erasure is a good thing in that case, as it adds more low-level complexity.
Another thing I remembered is that Arrow has an official variable-shape-tensor spec. It's very close to the new multi-archetype proposal: https://arrow.apache.org/docs/format/CanonicalExtensions.html#variable-shape-tensor if we go down the multi-archetype path I think we should consider aligning with the official spec since it raises the probability that another library that outputs or consumes tensors from arrow will be able to return chunks already in this exact structure.
Ultimately we want datatype conversions, so we can have a single TensorBuffer component be backed by multiple datatypes (BufferU8, BufferU16, …). However datatype conversions are not here yet.
This also aligns with the multi-archetype proposal. We can define the datatypes now, and because we don't support data type conversions it adds more work on the visualizes to support one-of-many components. Later, we remap to a single component but the arrow-encoded datatypes don't have to change. This will be far easier to manage from both a migration and a backwards compatibility perspective.
https://arrow.apache.org/docs/format/CanonicalExtensions.html#variable-shape-tensor
That reminds me: we should have separate shape: [u64]
and dim_names: [str]
components:
We want to support many precise and common pixel formats, like RGB565, NV12, etc.
We also want to easily be able to convert tensors (e.g. numpy arrays) into an Image
without having a combinatorial explosion of weird pixel formats that will dilute the PixelFormat
enum (BGR_i16
etc).
The Image
archetype must have EITHER a PixelFormat
enum with the most common pixel formats, OR the two ColorModel
+ ElementType
components. PixelFormat
always wins if it exists.
PixelFormat
pathenum PixelFormat {
RGB_565,
RGBA_8888,
NV12,
YUY2,
…
}
Fully specifies the ColorModel
and Planarity
of an image.
See https://facebookresearch.github.io/ocean/docs/images/pixel_formats_and_plane_layout/ for prior art on naming these.
ColorModel
and Datatype
If there is no PixelFormat
, the viewer will look for two other components:
ColorModel
: RGB, RGBA, BGR, BGRA, L, LA, A, …ElementType
: U4, U8, U10, U16, U32, F16, F32, F64, …This spans a huge cartesian product space of possibilities, and makes it easy to convert a tensor or numpy array into an Image
.
Having as two enum components means:
ColorModel
sColorType
needs specifying (e.g. with rr.Image(rgb=numpy_image)
or rr.Image(np_img, "RGB")
)In the future we can also add a component here to specify Planarity
: Are the RGB components interleaved or in different planes?
More research needed, but at the start we only need to support two things:
enum ColorSpace { sRGB_Linear, sRGB_Encoded }
The linear vs encoded only applies to ColorModel+Datatype
.
component enum PixelFormat {
RGB_565,
RGBA_8888,
NV12,
YUY2,
…
}
component enum ColorModel { RGB, RGBA, BGR, BGRA, L, LA, A }
component enum ElementType { U4, U8, U10, U16, U32, F16, F32, F64, … }
archetype Image {
// Required:
buffer: ImageBuffer,
resolution: Resolution2D,
// If not specified, then `ColorModel` AND `ElementType` MUST be specified
pixel_format: Option<PixelFormat>,
color_model: Option<ColorModel>,
element_type: Option<ElementType>,
// Optional (and not in first version):
color_space: Option<ColorSpace>,
// Largest value, e.g. `1.0` or `255.0`, useful for `ElementType::F32`
max: Option<f64>,
stride: Option<ImageStride>,
}
archetype DepthImage {
// Required:
buffer: ImageBuffer,
resolution: Resolution2D,
element_type: ElementType,
// Optional (and not in first version):
stride: Option<ImageStride>,
}
archetype SegmentationImage {
// Required:
buffer: ImageBuffer,
resolution: Resolution2D,
element_type: ElementType,
// Optional (and not in first version):
stride: Option<ImageStride>,
}
archetype Tensor {
…
// If set, interpret this tensor as an image.
color_model: Option<ColorModel>,
}
https://rerunio.slack.com/archives/C041NHU952S/p1721031794678379
The proposed design above introduces quite a bit of complexity:
archetype Image {
// Required:
buffer: ImageBuffer,
resolution: Resolution2D,
// If not specified, then `ColorModel` AND `ElementType` MUST be specified
pixel_format: Option<PixelFormat>,
color_model: Option<ColorModel>,
element_type: Option<ElementType>,
}
Whereas we previously had a single self-describing TensorData component, we now need to query 5 different components and join them to interpret the contents of the ImageBuffer. There are many more edge cases now allowing for misuse, partial updates, and invalid representations.
There are a few practical benefits to splitting ImageBuffer from the metadata describing the image shape. Examples:
However, we can still get the vast majority of this utility by combining the assorted metadata into a single, required, ImageFormat component:
archetype Image {
// Required:
buffer: ImageBuffer,
format: ImageFormat
}
// More noise from our datamodel.
component ImageFormat {
datatype ImageFormat;
}
datatype ImageFormat {
width: uint,
height: uint,
row_stride: Option<uint>
pixel_format: PixelFormat,
color_model: Option<ColorModel>, // used if pixel_format = ARRAY
element_type: Option<ElementType> // used if pixel_format = ARRAY
}
component enum PixelFormat {
ARRAY,
RGB_565,
RGBA_8888,
NV12,
YUY2,
…
}
This also simplifies the ImageFormat a bit by explicitly splitting out width and height, as well as making pixel_format always required, with a special value indicating the usage of the last 2 optional fields.
The only practical downside of this change is you can't (yet) just override color_model. You have to override the entirety of the image format.
Since the above apparently wasn't clear enough in terms of the problems it was trying to address:
If Resolution, Stride, PixelFormat, ColorModel, and Datatype are all separate components, this means they can be updated totally asynchronously and that they all need to be joined together by either a range or latest-at query in order to have all the information to interpret an image buffer.
In turn, every image query must bottom out in a 5-way component join, which, in a pathological case, spans 5 chunks and has potentially different join results depending on the timeline, some of which yield corrupt data interpretation.
Here’s an example: let’s suppose someone first logs an image using nv12 pixel format and then logs an image using rgb color model. If they don’t realize they needed to explicitly clear pixel format (since it’s an optional component), it now joins into their query results implicitly via latest-at, yielding corrupted data.
If everyone only uses our archetype APIs and those APIs are carefully crafted and tested we can generally work around this issue. But this is back to saying our low-level arrow APIs aren’t approachable to users, which is the very problem we’re trying to solve.
u8
blob, not a genericTensorData
.Examples of pixel formats:
RGBA8
RG32F
NV12
JPEG
~Also:
Image
straight out of a numpy tensor, we still log the shape as a component, which the viewer will use to guess the stride and pixel format, unless they have been logged explicitly by the user.JPEG
).Update
Following live discussion, we introduce
ImageEncoded
, as such: