spiraldb / vortex

An extensible, state-of-the-art columnar file format
https://vortex.dev
Apache License 2.0
902 stars 22 forks source link

Add a "bytes" DType #16

Open gatesn opened 8 months ago

gatesn commented 8 months ago

Since dtype is logical type, we should distinguish between uint8 and a byte (with underlying u8 ptype).

This will allow us to perform different compression strategies. e.g. not much point trying bitpicking for an arbitrary byte array.

robert3005 commented 8 months ago

Interesting. I thought the encoding provides differentiation. Arbitrary byte array will be some kind of varbin so varbin when recursing on its children will know what kind of compressions make sense

gatesn commented 8 months ago

Encoding == physical, dtype == logical.

By the same argument, you could say we should have Timestamp encoding and not a Timestamp dtype since it's just an int64 underneath. But I'm not sure that would be right?

Same for UTF8

robert3005 commented 8 months ago

I don’t think the same reasoning leads the result you arrived at. Arbitrary binary type has different logical type already, same for timestamp. If you’d like to apply the same logic you’d have to add seconds/minute/hours and so on dtypes which I’m not advocating for. Physical encoding shouldn’t leak into logical layer imho

gatesn commented 8 months ago

Hmmm, yeah there's a difference between an array of binary values VarBin(Bytes), and a binary array.

Ok, yep, let's instead have the encodings configure the compression ctx based on what they know. e.g. the GCD of 3600 we mentioned.