spiraldb / vortex

An extensible, state-of-the-art columnar file format
https://vortex.dev
Apache License 2.0
996 stars 27 forks source link

Narrow indices arrays during write / compression #1447

Open gatesn opened 3 days ago

gatesn commented 3 days ago

Indices arrays are typically u64, even when they rarely need to be that wide. That means during decompression, we unpack into far more memory than necessary.

During write time, perhaps in the same operation that (doesn't exist yet) resets offsets to zero (i.e. shifts boolean arrays to be zero-aligned), and trims unused dictionary values.

Alternatively, we can use stats at read-time and have some sort of "canonicalize_smallest / canonicalize_into(DType)" function that lets us decompress into the smallest viable width.