Compressed regions? (using fast compression algorithm i.e. LZ4 or Zstandard)

csabahruska commented 3 years ago

Hi, Would it be possible to add support for data compression? The modern compression algorithms are very fast (Zstd, LZ4) and could help at I/O bound cases. Is this a silly idea?

ckoparkar commented 3 years ago

@csabahruska What do you have in mind when you say "add support for data compression"?

If you're thinking of de/compressing the entire input/output byte-stream, it's certainly possible. But I wonder what would be more efficient, "mmap -> decompress -> RUN PROG -> compress -> write" or the current "mmap -> RUN PROG -> write", even for programs that do a lot of IO. We've certainly never benchmarked a program like this, but it's an interesting step.

csabahruska commented 3 years ago

I mean adding support for compressed regions that can hold compressed data generated at runtime. Not just the program input, but any input or output buffer that the LoCal/Gibbon program can handle in general. Could LoCal be extended to track the representation of the regions (uncompressed / compressed + algorithm) and generate program that immediately (LZ4/Zstd streaming API) transforms the input/output data before processing it? So what I mean is a generic solution and not just a compression wrapper around the LoCal code, but to apply this idea in the innards of LoCal.

ckoparkar commented 3 years ago

Could LoCal be extended to track the representation of the regions (uncompressed / compressed + algorithm) and generate program that immediately (LZ4/Zstd streaming API) transforms the input/output data before processing it?

Let's consider an example to flesh out this situation a bit more. Consider a Gibbon program which reads input data from disk into a region r1, performs some computation foo on region r1, allocates the output into another region r2, and then writes this region r2 to disk again. And suppose that the input data was lz4 compressed. Are you wondering if Gibbon can be made sufficiently smart so that the data in regions r1 and r2 would remain in lz4 format the whole time? In other words, the computation foo would run directly on a compressed region r1, and would produce a compressed region r2, without ever having to decompress?

--

Longer term, we'd like to support computations over different encodings such as avro, protobuf's etc. I was going though the lz4's block and frame format, but I'm still trying to understand what computation over it would be like. It's something to think about.

csabahruska commented 3 years ago

Yes, that is what I meant. Just a little clarification:

without ever having to decompress?

Without ever fully decompress. Instead a tiny piece of compressed data (sliding window) would be decompressed before foo uses it and the result would immediately compressed back via the lz4 streaming API.

And each region could have its own representation format that the programmer controls.

csabahruska commented 3 years ago

I've found a project that explores the in-memory compressed data representation idea: https://blosc.org/pages/blosc-in-depth/ https://www.blosc.org/docs/StarvingCPUs-CISE-2010.pdf https://github.com/Blosc/c-blosc https://blosc.org/

ckoparkar commented 3 years ago

Hi @csabahruska. Sorry for being slow --- we were busy with ICFP work.

This Blosc project looks nice, thanks for sharing it with us! The CISE-2010 pdf motivates this problem nicely, but doesn't contain a lot of detailed information on how this blocking technique actually works? As far as I understand, they're proposing to split the input data into small parts and then operate on it. I'll have to look at it more closely because I'm definitely missing something.

With respect to directly operating on lz4 data, it seems possible, but I'm having trouble visualizing what the memory access pattern for a simple function would be. But that's because I've never used lz4 directly, and that's why lots of things are unclear to me. I guess a direct way forward is to write a small C program, and see how it works. I'll do it this week and tell you what happens.

csabahruska commented 3 years ago

Hi,

The following paper describes the blocking technique in greater detail. https://www.blosc.org/docs/Breaking-Down-Memory-Walls.pdf

amigalemming commented 3 years ago

Sounds like a crazy idea. Do you intend to save memory for long-living data in memory or do you intend to increase throughput by utilizing cache effects or what is the motivation?

csabahruska commented 3 years ago

Both cases are valid. https://www.blosc.org/posts/beast-release/

iu-parfunc / gibbon

Compressed regions? (using fast compression algorithm i.e. LZ4 or Zstandard) #130