Closed realratchet closed 3 months ago
If we're going to keep the pickled arrays in favor of new format spec we can do something like this for our strings to compress them
import numpy as np
import pickle as pkl
import lz4.frame as lz4
class CompressedArray(np.ndarray):
@staticmethod
def from_compressed(shape, dtype, algo, buffer):
algos = {
"lz4": lz4
}
return CompressedArray(shape, dtype, algos[algo].decompress(buffer))
def __reduce__(self):
return self.from_compressed, (tuple(self.shape), str(self.dtype), "lz4", lz4.compress(self.tobytes()))
abc = np.array(["a", "bc", "defg"]).tobytes()
arr = CompressedArray((3, ), "<U4", abc)
pkl_arr = pkl.dumps(arr)
upkl_arr = pkl.loads(pkl_arr)
Intro
Here is a quick sketch for the format I'm thinking, some inspiration from GLTF. Here's a quick overview.
Preamble
First 12 bytes is the preamble, which is as follows:
TBLT
which is 0x54424C54)Header
Header is a JSON to make the format fully extensible in the future. Currently I have these in mind that we should have:
Obviously this format cannot deal with mixed types although we talked about getting rid of mixed types and replacing with the datatype that encompass all of the elements in the column. Obviously this means that we cannot represent
None
either and have to fill with the default value e.g, empty string for string types, zero for integer types. If we would ever want to go back to mixed types we can instead have an array of fragments.Types
The types that we support:
We keep the numpy modifiers, meaning we can still set
<
and>
for endianness and size when valid, i.e., we can still do<U64
for a page that has the longest string of length64
and is little endian.Compression
One of the current issues we have with numpy format is the enormous size of string pages due to padding. We still want to keep the padding because we want good interoperability with python but we also want reduced page size with minimal impact to performance as possible. Therefore we introduce
"compression"?
key in the header. It's an optional key that may or may not exist in the JSON, if not provided treated asnull
. It should only really be used for strings but could technically be paired with other types too.I tried multiple decompression algorithms the least additional processing time to read the pages was by
lz4
but there still was processing time so maybe the table producing function should try and see if it's necessary to even use the compression. But that can be parked for now.Extended interface
Current numpy format has no concept of
date
type which while is just a subset ofdatetime
and can be fully expressed by it, it has a different__repr__
function and we're likely going to get a lot of complaints if all of thedate
formats will be turned intodatetime
because the table will be filled withYYYY-MM-DD 00:00:00
strings.Handling strings
Even though compressed strings in disk don't need to be contiguous memory I think it's probably still best to store them as so just because
np.frombuffer
will work much nicer with it. Therefore, we replicate the way numpy stores the strings and use zero pad at the end if a string is shorter than the longest string.Page
This is the page data stored in either compressed memory block or contiguous data block. Just raw binary dump nothing special.
The downsides
Because we said we're getting rid of the mixed type I accounted for that when conjuring the data format although this means we cannot have
None
which is easy with string pages but personally seems iffy when it comes to other datatypes. Concepting engine has special flags to check forNone
types in some widgets which would make those flags obsolete?Also treating
None
as default value, e.g.,0.0
when afloat
could maybe potentially have undesired side-effects? But I may be overthinking it and this isn't really an issue I should be worried about.Other possible additions
We could also pull in extra information about the page into the JSON, e.g., statistics if we think it's beneficial, I tried to keep this extensible.