Open pseudotensor opened 5 years ago
Could even allow us to pass any filter of any kind for the file or file per column.
https://stackoverflow.com/questions/12660028/reading-memory-mapped-bzip2-compressed-file
With bzip2 one can seek with specific byte offsets many times, and so still memory map a compressed bzip2 file afaik.
Memory mapping is not always a clear benefit, and taking up 10X the disk space can matter and is a major negative of memory mapping.
E.g. in DAI I think we primarily read columns anyways into memory one at a time, so per-column compression could make it much faster to load a file (loading a 30GB vs. 3GB file).
The SO question that you linked ultimately refers to this library for partial decompressing of bzip2 files: https://github.com/bxlab/bx-python/blob/325f495f3d6273f225acd3097216bbbfe462facf/src/bunzip/micro-bunzip.c
Unfortunately, it is LGPL-licensed, which means it's not much use to us... There could be other more open-licensed tools that do the same, but I'm not too hopeful. Still, we can use compression for storage and then uncompress when the data is needed.
Prerequisite: #1396
Compression filter option using gzip per column in (say) jay file. Leads to savings of 10X for disk storage for large data sets.
https://github.com/h2oai/h2oai/issues/6211#issuecomment-476905136
Related code:
https://github.com/h2oai/h2oai/blob/dev/h2oaicore/systemutils.py#L1331-L1384