Open gatesn opened 8 months ago
Interesting. I thought the encoding provides differentiation. Arbitrary byte array will be some kind of varbin so varbin when recursing on its children will know what kind of compressions make sense
Encoding == physical, dtype == logical.
By the same argument, you could say we should have Timestamp encoding and not a Timestamp dtype since it's just an int64 underneath. But I'm not sure that would be right?
Same for UTF8
I don’t think the same reasoning leads the result you arrived at. Arbitrary binary type has different logical type already, same for timestamp. If you’d like to apply the same logic you’d have to add seconds/minute/hours and so on dtypes which I’m not advocating for. Physical encoding shouldn’t leak into logical layer imho
Hmmm, yeah there's a difference between an array of binary values VarBin(Bytes)
, and a binary array.
Ok, yep, let's instead have the encodings configure the compression ctx based on what they know. e.g. the GCD of 3600 we mentioned.
Since dtype is logical type, we should distinguish between uint8 and a byte (with underlying u8 ptype).
This will allow us to perform different compression strategies. e.g. not much point trying bitpicking for an arbitrary byte array.