spiraldb / vortex

A toolkit for working with compressed Arrow in-memory, on-disk, and over-the-wire. "The LLVM of file formats"
Apache License 2.0
212 stars 12 forks source link

Audit serde of array metadata #396

Open gatesn opened 3 months ago

gatesn commented 3 months ago

We currently implement naive serde using Rust serde + flexbuffers by default. Many arrays can pack their metadata much more tightly. This is an overview issue to track auditing each one:

danking commented 1 week ago

I'm gonna try making the Validity metadata for Structs much smaller.

danking commented 1 week ago

We might eventually want to squeeze all metadata into 32-bits. We can reserve 0xffffffff to indicate that the metadata has spilled into a buffer.

robert3005 commented 1 week ago

I think we can spare 64 bits per encoding

gatesn commented 1 week ago

For most arrays, validity metadata is just a single bit for whether or not a validity child is defined.

danking commented 2 days ago

RunEnd

remove length, dtype => ptype (it has to be an int).

pub struct RunEndMetadata {
    validity: ValidityMetadata,
    ends_dtype: DType,
    num_runs: usize,
    offset: usize,
    length: usize,
}

ALP

pub struct ALPMetadata {
    exponents: Exponents,
    encoded_dtype: DType,
    patches_dtype: Option<DType>,
}

RunEndBool

remove length, dtype => ptype.

pub struct RunEndBoolMetadata {
    start: bool,
    validity: ValidityMetadata,
    ends_dtype: DType,
    num_runs: usize,
    offset: usize,
    length: usize,
}

RoaringInt

pub struct RoaringIntMetadata {
    ptype: PType,
}

FoR

Scalar => ScalarValue, use self.dtype(). Buffer, BufferString, List should go into the Array buffer.

pub struct FoRMetadata {
    reference: Scalar,
    shift: u8,
}

Dict

DType => PType

pub struct DictMetadata {
    codes_dtype: DType,
    values_len: usize,
}

DateTimeParts

DType => PType.

pub struct DateTimePartsMetadata {
    days_dtype: DType,
    seconds_dtype: DType,
    subseconds_dtype: DType,
}

FSST

DType => PType.

pub struct FSSTMetadata {
    symbols_len: usize,
    codes_dtype: DType,
    uncompressed_lengths_dtype: DType,
}

Null

remove len.

pub struct NullMetadata {
    len: usize,
}

Primitive

pub struct PrimitiveMetadata {
    validity: ValidityMetadata,
}

VarBin

DType => PType

pub struct VarBinMetadata {
    validity: ValidityMetadata,
    offsets_dtype: DType,
    bytes_len: usize,
}

Delta

pub struct DeltaMetadata {
    validity: ValidityMetadata,
    deltas_len: usize,
    offset: usize, // must be <1024
}

RoaringBool

Remove length

pub struct RoaringBoolMetadata {
    length: usize,
}

BitPacked

Remove length.

pub struct BitPackedMetadata {
    validity: ValidityMetadata,
    bit_width: usize,
    offset: usize, // Know to be <1024
    length: usize, // Store end padding instead <1024
    has_patches: bool,
}

ByteBool

pub struct ByteBoolMetadata {
    validity: ValidityMetadata,
}

ZigZag

pub struct ZigZagMetadata

Extension

DType => PType

pub struct ExtensionMetadata {
    storage_dtype: DType,
}

Struct

Remove length.

pub struct StructMetadata {
    length: usize,
    validity: ValidityMetadata,
}

Chunked

pub struct ChunkedMetadata {
    num_chunks: usize,
}

Sparse

remove len, DType => PType, Scalar => ScalarValue.

pub struct SparseMetadata {
    indices_dtype: DType,
    // Offset value for patch indices as a result of slicing
    indices_offset: usize,
    indices_len: usize,
    len: usize,
    fill_value: Scalar,
}

Constant

Scalar => ScalarValue, remove length.

pub struct ConstantMetadata {
    scalar: Scalar,
    length: usize,
}

Bool

Remove length.

pub struct BoolMetadata {
    validity: ValidityMetadata,
    length: usize,
    bit_offset: usize,
}

VarBinView

pub struct VarBinViewMetadata {
    validity: ValidityMetadata,
    data_lens: Vec<usize>,
}
danking commented 12 hours ago

Three relevant PRs:

  1. https://github.com/spiraldb/vortex/pull/956
  2. https://github.com/spiraldb/vortex/pull/955
  3. https://github.com/spiraldb/vortex/pull/951