h2oai / datatable

A Python package for manipulating 2-dimensional tabular data structures
https://datatable.readthedocs.io
Mozilla Public License 2.0
1.82k stars 157 forks source link

Internal compression filter option using gzip per column #1748

Open pseudotensor opened 5 years ago

pseudotensor commented 5 years ago

Compression filter option using gzip per column in (say) jay file. Leads to savings of 10X for disk storage for large data sets.

https://github.com/h2oai/h2oai/issues/6211#issuecomment-476905136

Related code:

https://github.com/h2oai/h2oai/blob/dev/h2oaicore/systemutils.py#L1331-L1384

pseudotensor commented 5 years ago

Could even allow us to pass any filter of any kind for the file or file per column.

pseudotensor commented 5 years ago

https://stackoverflow.com/questions/12660028/reading-memory-mapped-bzip2-compressed-file

With bzip2 one can seek with specific byte offsets many times, and so still memory map a compressed bzip2 file afaik.

pseudotensor commented 5 years ago

Memory mapping is not always a clear benefit, and taking up 10X the disk space can matter and is a major negative of memory mapping.

E.g. in DAI I think we primarily read columns anyways into memory one at a time, so per-column compression could make it much faster to load a file (loading a 30GB vs. 3GB file).

st-pasha commented 5 years ago

The SO question that you linked ultimately refers to this library for partial decompressing of bzip2 files: https://github.com/bxlab/bx-python/blob/325f495f3d6273f225acd3097216bbbfe462facf/src/bunzip/micro-bunzip.c

Unfortunately, it is LGPL-licensed, which means it's not much use to us... There could be other more open-licensed tools that do the same, but I'm not too hopeful. Still, we can use compression for storage and then uncompress when the data is needed.

st-pasha commented 5 years ago

Prerequisite: #1396