quixdb / squash

Compression abstraction library and utilities
https://quixdb.github.io/squash/
MIT License
405 stars 53 forks source link

Report basic time and memory performance characteristics #36

Open Intensity opened 11 years ago

Intensity commented 11 years ago

It's a great benefit to see many compressors integrated under a common and flexible interface. I can imagine a variety of use cases where the caller is interested in trying various algorithms and parameters. When those various algorithms are run, the resultant compressed size for that data is a valuable data point; this comes automatically as part of the return code. However, many compression comparison pages which describe the difference between algorithms and the parameters include additional information: the (de)compression time and memory used in these steps.

I'd value having a common and easy-to-use interface to the effective CPU and memory overhead for this particular algorithm and provided options, maybe as simple as a structure which is populated with that data. I'd like to learn how much memory was allocated during the process of (de)compression, and weigh that accordingly. A separate ticket item (another enhancement) is to deliberately put a hard limit on the available memory for an algorithm (wrapping its malloc () and free ()). But for now it would be helpful to learn how much memory the algorithm did use with the provided compression parameters. This could be returned in a very basic way. Also (again, willing to put it into another ticket), maybe the number of threads spawned can be provided as well as specified.

Finally, it would be helpful to see how long the (de)compression takes, measured in terms of CPU and real time. This wouldn't need to take over the aim of a benchmarking tool (which would run the test repeatedly to get a good estimate, or may flush caches before its run, and so on), but a basic time used during the algorithm would be a helpful statistic, measured at the finest resolution that is supported.

nemequ commented 11 years ago

I'm assuming you're talking about something to help you write your own benchmarks, not just the benchmarking tool distributed with Squash. That does give you CPU time + wall clock time (although it's presented as compression/decompression rate in the HTML output, and only CPU time is used).

I've been thinking about benchmarks lately, and looking into how to do them right. Long story short, It's hard. I suggest you take a look at Aleksey Shipilev's JVMLS 2013 talk, "JVM Benchmarking". It's about Java, but the first half is pretty generic, and quite interesting. So, my current thinking is that Squash (and everyone else) should use a benchmarking library which does all that stuff for us. The only thing I've found so far is cbenchsuite, which I haven't yet had time too look at closely, so I don't know whether or not it would be suitable.

Measuring memory usage is tricky. Most libraries don't let us supply a custom malloc/free function, so AFAIK the only way to do that would be to override malloc using LD_PRELOAD. Even with overriding malloc the numbers may not be accurate. I know mcm is planning to use mmap + MAP_ANONYMOUS instead of malloc (perhaps it already does, haven't looked at the code since Mathieu Chartier mentioned that was his plan), and SHARC will emit an error if memory isn't aligned to something (IIRC 8-byte, but I'm not positive), so I would be surprised if its internal buffers didn't use posix_memalign, and I don't know if overriding malloc would catch that or not. Just asking the OS to tell you how much memory is in use before and after isn't reliable, either, because most mallocs just use sbrk, so they can't necessarily return all the memory they've allocated.

Intensity commented 11 years ago

I am interested in being able to request (by way of options) to receive basic timing information from an individual compressor invocation, even if it's an in-memory request. I'd find value in getting that information on an individual invocation of the compression, not just as part of a separately run benchmark. It seems that in the benchmarking plugin, clock_gettime () is preferred, but more generally other approaches to can be used on other platforms, so this is open to further tweaking. The request is also a placeholder to express interest in measuring information about other performance bottlenecks - for example, deducing how much memory latency or processor cache size got in the way. Compiler choice, the processor type, and available resources would all affect the resultant performance so maybe there is a tool which tries to deduce what the bottlenecks are. Thanks for the reference video on benchmarking. It seems cbenchsuite is pretty comprehensive so a subset of this could be used.

As for memory usage, it may not be easy or possible to get information reliably for all plugins, but a best effort can be of use even if it's a subset. Alternative measures can be suggested in time, too. There is an option to do a conditional preprocessor define that overrides the malloc () instances in the code to invoke one that keeps track of statistics on allocated memory. And for the other approaches like mmap or the aligned memory requirement, a different override can be used, or some plugins just might have to resort to less reliable methods of memory usage estimates.

nemequ commented 11 years ago

It seems that in the benchmarking plugin, clock_gettime () is preferred, but more generally other approaches to can be used on other platforms, so this is open to further tweaking.

Indeed. I had to tweak it recently for FreeBSD, and I'm guessing Windows does something completely different. I'm willing to try to add support for other systems as people need it.

The request is also a placeholder to express interest in measuring information about other performance bottlenecks - for example, deducing how much memory latency or processor cache size got in the way.

That would definitely be great information to have, but I have no idea how to detect it—I suspect that if there is a way it would be pretty OS-specific (and/or hardware-specific), which means a big and/or complex code base would probably be required to support it, which (at least to me) implies a separate library.

I am interested in being able to request (by way of options) to receive basic timing information from an individual compressor invocation, even if it's an in-memory request. I'd find value in getting that information on an individual invocation of the compression, not just as part of a separately run benchmark.

How do you envision that working? Options are just strings of keys values and are intended to be possible to share across multiple, possibly parallel, compression/decompression operations. With that in mind, I don't see how they can be leveraged to provide this functionality, and I don't think this is a sufficient reason to reconsider that. Think about using Squash to compress/decompress rows in a database… being able to create a single SquashOptions at initialization and simply reuse that whenever you need to compress or decompress a chunk of data is a big deal.

The only way I see to be helpful here (other than creating another version of every API call) would be to create something like GTimer (only I'd want to support CPU time in addition to wall clock) and include it in Squash. I'd probably be okay with something like that if we committed to stopping there, but if we don't then where does it end? Do we add a way to handle warm-up elegantly? Restrict turbo boost? Scheduling? Memory profiling?

Like I said, I think that stuff belongs in a separate library. I think something similar to a unit testing API (I like glib's, but I'm open to alternatives) would be ideal. I don't really know how cbenchsuite works, but if it fits the bill I'd love to use it instead of reinventing the wheel.

As for memory usage, it may not be easy or possible to get information reliably for all plugins, but a best effort can be of use even if it's a subset. Alternative measures can be suggested in time, too. There is an option to do a conditional preprocessor define that overrides the malloc () instances in the code to invoke one that keeps track of statistics on allocated memory. And for the other approaches like mmap or the aligned memory requirement, a different override can be used, or some plugins just might have to resort to less reliable methods of memory usage estimates.

This seems to be making my point about a separate library. This is all very cool stuff but it's a lot of bloat to include in Squash, and it's not really compression-specific, so why not a separate library?

nemequ commented 11 years ago

I've split the timer code out of benchmark/benchmark.c into benchmark/timer.c and benchmark/timer.h. This would be the GTimer-like API I was talking about.

It's still not part of Squash itself, but it might be at some point.

nemequ commented 10 years ago

I just pushed a commit to the benchmark program which could make memory reporting in the benchmarks feasible. Before each compress/decompress attempt, we now fork(). The main purpose of this is so we can recover (I hope) from the OOM killer terminating the process which is taking up all the memory, since that was happening on my Raspberry Pi with LZMA at higher levels. However, it may be possible to grab the maximum memory usage before we do anything and then after compression/decompression, and use the difference as the memory usage. We would have to add another fork() so we can separate compression and decompression, but I think it could work.

If we do this, it may be wise to revisit the squash_compress_file* functions, which currently prefer to mmap the input/output and use the buffer to buffer API instead whenever buffer to buffer functions are implemented in plugins instead of only when the stream functions aren't. This was done because it could provide a performance boost (since having to memcpy is less likely) for some codecs. but if the mapped files get counted in the memory usage it wouldn't really be fair.