root-11 / tablite

multiprocessing enabled out-of-memory data analysis library for tabular data.
MIT License
37 stars 8 forks source link

Proposed format specification #116

Closed realratchet closed 3 months ago

realratchet commented 9 months ago

Intro

Here is a quick sketch for the format I'm thinking, some inspiration from GLTF. Here's a quick overview.

image

Preamble

First 12 bytes is the preamble, which is as follows:

Header

Header is a JSON to make the format fully extensible in the future. Currently I have these in mind that we should have:

{
    "page": {
        "type": <type>,
        "compression"?: "lz4"|null
    }
}

Obviously this format cannot deal with mixed types although we talked about getting rid of mixed types and replacing with the datatype that encompass all of the elements in the column. Obviously this means that we cannot represent None either and have to fill with the default value e.g, empty string for string types, zero for integer types. If we would ever want to go back to mixed types we can instead have an array of fragments.

Types

The types that we support:

We keep the numpy modifiers, meaning we can still set < and > for endianness and size when valid, i.e., we can still do <U64 for a page that has the longest string of length 64 and is little endian.

Compression

One of the current issues we have with numpy format is the enormous size of string pages due to padding. We still want to keep the padding because we want good interoperability with python but we also want reduced page size with minimal impact to performance as possible. Therefore we introduce "compression"? key in the header. It's an optional key that may or may not exist in the JSON, if not provided treated as null. It should only really be used for strings but could technically be paired with other types too.

I tried multiple decompression algorithms the least additional processing time to read the pages was by lz4 but there still was processing time so maybe the table producing function should try and see if it's necessary to even use the compression. But that can be parked for now.

Extended interface

Current numpy format has no concept of date type which while is just a subset of datetime and can be fully expressed by it, it has a different __repr__ function and we're likely going to get a lot of complaints if all of the date formats will be turned into datetime because the table will be filled with YYYY-MM-DD 00:00:00 strings.

Handling strings

Even though compressed strings in disk don't need to be contiguous memory I think it's probably still best to store them as so just because np.frombuffer will work much nicer with it. Therefore, we replicate the way numpy stores the strings and use zero pad at the end if a string is shorter than the longest string.

Page

This is the page data stored in either compressed memory block or contiguous data block. Just raw binary dump nothing special.

The downsides

Because we said we're getting rid of the mixed type I accounted for that when conjuring the data format although this means we cannot have None which is easy with string pages but personally seems iffy when it comes to other datatypes. Concepting engine has special flags to check for None types in some widgets which would make those flags obsolete?

Also treating None as default value, e.g., 0.0 when a float could maybe potentially have undesired side-effects? But I may be overthinking it and this isn't really an issue I should be worried about.

Other possible additions

We could also pull in extra information about the page into the JSON, e.g., statistics if we think it's beneficial, I tried to keep this extensible.

realratchet commented 9 months ago

If we're going to keep the pickled arrays in favor of new format spec we can do something like this for our strings to compress them

import numpy as np
import pickle as pkl
import lz4.frame as lz4

class CompressedArray(np.ndarray):
    @staticmethod
    def from_compressed(shape, dtype, algo, buffer):
        algos = {
            "lz4": lz4
        }

        return CompressedArray(shape, dtype, algos[algo].decompress(buffer))

    def __reduce__(self):
        return self.from_compressed, (tuple(self.shape), str(self.dtype), "lz4", lz4.compress(self.tobytes()))

abc = np.array(["a", "bc", "defg"]).tobytes()
arr = CompressedArray((3, ), "<U4", abc)

pkl_arr = pkl.dumps(arr)
upkl_arr = pkl.loads(pkl_arr)