mtiller / recon

Web and network friendly simulation data formats
MIT License
8 stars 4 forks source link

Support for compression #7

Closed xogeny closed 10 years ago

xogeny commented 10 years ago

A key goal with this format is to minimize reads. If compression were supported, it would have to be pretty localized (e.g. compressing individual columns) because this would avoid impacting the number of reads.

Header compression is possible, but it would be a bit problematic. The ID would have to reflect the fact that it was compressed and the length information to proceeds each document couldn't be included in the compression (again...impact on reads).

Compression of columns is probably more likely to have a significant impact on storage space than compression of the header (which probably won't include a lot of repetitive data).

Any open question would be...what type of compression? We'd want to use something that is typically available as part of standard libraries. For Python, zlib and bz2 seem to be easily accessible. But what about the Java and C platforms?

xogeny commented 10 years ago

Wow, I created a best case scenario case involving vectors of length 100 padded with zeros. The compressed version was 1/4 the size of the uncompressed. Not bad considering the fact that it isn't doing global compression.

It will be interesting to try this with some real data to see if it performs well in real world scenarios.