Open gatesn opened 3 months ago
I'm gonna try making the Validity metadata for Structs much smaller.
We might eventually want to squeeze all metadata into 32-bits. We can reserve 0xffffffff to indicate that the metadata has spilled into a buffer.
I think we can spare 64 bits per encoding
For most arrays, validity metadata is just a single bit for whether or not a validity child is defined.
remove length, dtype => ptype (it has to be an int).
pub struct RunEndMetadata {
validity: ValidityMetadata,
ends_dtype: DType,
num_runs: usize,
offset: usize,
length: usize,
}
pub struct ALPMetadata {
exponents: Exponents,
encoded_dtype: DType,
patches_dtype: Option<DType>,
}
remove length, dtype => ptype.
pub struct RunEndBoolMetadata {
start: bool,
validity: ValidityMetadata,
ends_dtype: DType,
num_runs: usize,
offset: usize,
length: usize,
}
pub struct RoaringIntMetadata {
ptype: PType,
}
Scalar => ScalarValue, use self.dtype(). Buffer, BufferString, List should go into the Array buffer.
pub struct FoRMetadata {
reference: Scalar,
shift: u8,
}
DType => PType
pub struct DictMetadata {
codes_dtype: DType,
values_len: usize,
}
DType => PType.
pub struct DateTimePartsMetadata {
days_dtype: DType,
seconds_dtype: DType,
subseconds_dtype: DType,
}
DType => PType.
pub struct FSSTMetadata {
symbols_len: usize,
codes_dtype: DType,
uncompressed_lengths_dtype: DType,
}
remove len.
pub struct NullMetadata {
len: usize,
}
pub struct PrimitiveMetadata {
validity: ValidityMetadata,
}
DType => PType
pub struct VarBinMetadata {
validity: ValidityMetadata,
offsets_dtype: DType,
bytes_len: usize,
}
pub struct DeltaMetadata {
validity: ValidityMetadata,
deltas_len: usize,
offset: usize, // must be <1024
}
Remove length
pub struct RoaringBoolMetadata {
length: usize,
}
Remove length.
pub struct BitPackedMetadata {
validity: ValidityMetadata,
bit_width: usize,
offset: usize, // Know to be <1024
length: usize, // Store end padding instead <1024
has_patches: bool,
}
pub struct ByteBoolMetadata {
validity: ValidityMetadata,
}
pub struct ZigZagMetadata
DType => PType
pub struct ExtensionMetadata {
storage_dtype: DType,
}
Remove length.
pub struct StructMetadata {
length: usize,
validity: ValidityMetadata,
}
pub struct ChunkedMetadata {
num_chunks: usize,
}
remove len, DType => PType, Scalar => ScalarValue.
pub struct SparseMetadata {
indices_dtype: DType,
// Offset value for patch indices as a result of slicing
indices_offset: usize,
indices_len: usize,
len: usize,
fill_value: Scalar,
}
Scalar => ScalarValue, remove length.
pub struct ConstantMetadata {
scalar: Scalar,
length: usize,
}
Remove length.
pub struct BoolMetadata {
validity: ValidityMetadata,
length: usize,
bit_offset: usize,
}
pub struct VarBinViewMetadata {
validity: ValidityMetadata,
data_lens: Vec<usize>,
}
We currently implement naive serde using Rust serde + flexbuffers by default. Many arrays can pack their metadata much more tightly. This is an overview issue to track auditing each one: