quixdb / squash-benchmark

Benchmarking all of the algorithms from Squash
https://quixdb.github.io/squash-benchmark/
MIT License
43 stars 10 forks source link

Unlikely results #17

Closed nemequ closed 9 years ago

nemequ commented 9 years ago

From Cyan4973/lz4#109:

Also, it's still too soon to link current results, as they can, at times, prove inconsistent. For example these ones, where a few fast compressors including LZ4 score an abysmal decompression speed < 20 MB/s, which can't be correct by a few order of magnitude. Once again, it's not reason to worry yet. As said, it's normal for a first version to have a few issues to sort out. It can be a fluke, a minor change, an intermediate bug, well whatever. So let's wait for the next version, and we'll start building on this future one.

Best guess right now is that it's because a background service or cron job kicked off there, which I should make more of an effort to prevent.

It would be useful to run the benchmark multiple times on the same machine, then load the data into a spreadsheet and compare the two runs… hopefully that will let us spot any anomalies and figure out a way to improve the methodology to avoid it happening again.

Cyan4973 commented 9 years ago

I guess that if you don't have a hardware dedicated to benchmark (at least during the duration of benchmark), it's difficult to fully automate triage, requiring human assistance. This can be a very time consuming task.

Having multiple separated run, as you suggest, could help to spot differences.

Cyan4973 commented 9 years ago

I used the opportunity to test your latest squash release 0.7 with your benchmark (great job by the way !).

I was a bit alarmed by some results displayed on https://quixdb.github.io/squash-benchmark/ showing large differences in performance between lz4f and lz4. For example, look at results from file sum tested on any x86 machine. lz4f -1 is way faster than lz4 -1, which shouldn't be.

After testing on my own laptop, using squash v0.7 and today's squash benchmark 057b57af1d4037bad9e66b35f45ac90b855d99da, I could witness no such difference. lz4 is in fact slightly faster than lz4f instead, as expected.

So the issue does not seem to come from the software (neither squash nor benchmark).

Could it be there were some differences in measurement conditions ? such as another process taking away some resource during benchmark ?

nemequ commented 9 years ago

What machine and dataset? Don't know how I missed that part, I'll look into it now…

I've been very careful about not running anything else at the same time… I disable all the cron jobs, make sure users are logged out, set the CPU governor to "performance". TBH I really don't know what else I can do, though of course I'm open to suggestions…

Cyan4973 commented 9 years ago

It could be the consequence of a recent update. I don't remember having noticed that effect before ...

nemequ commented 9 years ago

Recent update to what? You said you didn't see it with 0.7.0 on your machine…

I'm seeing a pretty big increase is speed if I disable the code to use mmap for codecs which don't have a streaming API, which I don't remember seeing before. Unfortunately I don't think just disabling it is an option, since the alternative is sucking up huge quantities of RAM… MAP_POPULATE doesn't seem to be helping, either.

nemequ commented 9 years ago

And, FWIW, previous versions of Squash also used mmap, though the relevant part of the code was rewritten for 0.7.0.

Cyan4973 commented 9 years ago

Recent update to what?

Recent update of *.csv files hosted on https://github.com/quixdb/squash-benchmark/tree/gh-pages

There are so many configurations, I can't guarantee having checked them all. I can only say I've never seen so far situations where lz4f is much faster than lz4.

This effect is not visible on every file. sum is the one I mentioned. Most other files don't have such problem.

I concentrated my latest investigation on smaller samples, since they have more chances to advertise some "weirdness". For example, xargs.1. On peltast configuration, lz4 compression speed is strictly identical at level 1 and level 2 : 671.86 MB/s. But at level 3, it suddenly drops at 7.72 MB/s, an almost 100x difference. Such effect could likely be explained by limited measurement accuracy, where a round up or down swing results by large amounts.

nemequ commented 9 years ago

I don't think the issue is the accuracy of the measurements… I'm seeing pretty consistent results, that wouldn't happen if the measurements weren't consistent. I think the issue has to do with lz4's interaction with mmap.

First, a quick overview of how the mmap thing works in Squash. If a code doesn't provide a streaming implementation (the lz4f plugin does, but lz4 doesn't), Squash will attempt to mmap the input and output files, and pass pointers to the buffer-to-buffer compression function (like LZ4_compress_limitedOutput) so it doesn't have to read the entire file into memory as that could obviously be a problem for large files.

I'm playing with a patch to control whether to use mmap through an environment variable so its easy to test, and when mmap is enabled LZ4's performance is much worse than when it isn't. Other codecs don't see anything like that performance drop; for a small file like sum brieflz sees about a 10% drop. Significant, but nothing like LZ4. Also, smaller in absolute time, not just as a percentage o the total.

Interestingly, the copy plugin (which is just memcpy) sees a huge performance boost when I tell it to use mmap.

My current thinking is that maybe something changed in a recent kernel which causes a performance regression with mmaped memory on Intel. I want to try squash-0.7.0 on an older kernel, and squash-0.6.0 on 4.1.6, but it's probably going to have to wait until next week.

nemequ commented 9 years ago

I still want to investigate the cause, but I decided to update the benchmark with versions that aren't mapped for now. I have updated e-desktop, s-desktop, and peltast. phalanx should come later today, and hoplite tomorrow.

The more I think about it the more I have a feeling it is a regression in the kernel on x86_64. ARM is still fast with mmap, so it's probably not a bug in squash. x86_64 was fast until this update, so it's not just that mmap is slow on x86. When running perf with squash's lz4 codec I saw a lot of stalled cycles in the VMA code in the kernel…

Cyan4973 commented 9 years ago

A kernel regression, sounds like a neat bug report. Although, just to be sure, testing with older kernel version could help to state that it is a regression.

In the meantime, googling about it, I found this interesting entry on SO : http://stackoverflow.com/questions/45972/mmap-vs-reading-blocks

Edit : this doc is also interesting : http://web.cs.ucla.edu/honors/UPLOADS/kousha/thesis.pdf

Edit 2 : the pb of result quantization is a different one. It affects very small samples (a few KB), and remains present, with or without mmap.

nemequ commented 9 years ago

Yeah, I plan to do a lot of testing next week :(

mmap is actually a really good fit for this (assuming the performance doesn't tank like this), especially for machines without tons of memory and/or when working with large files. There is a bit of random access on the input side, but its fairly constrained, which means the kernel can drop old pages from the cache; if we had read the entire file into RAM the kernel would have to swap in case the data is needed later.

There is a similar issue with the output, only without much random access; instead of swapping pages out when low on memory, then swapping them back in only to write them to disk, with mmap we basically just tell the kernel to write them to disk whenever it thinks is best (with munmap actually flushing everything when we're done).

Basically, it's a big win for resource-constrained systems (or large files). Lots of the stuff in the benchmark was actually failing (with ENOMEM, or just crashing because of overcommitting, or best-case thrashing badly) on the ARM boards and satellite-a205 before I added the mmap code, and those files aren't even that big.

travisdowns commented 9 years ago

I would be very careful in making the assumption that it is kernel bug without a pretty decent reproducible test case. If it does in fact seem to happen only newer kernels, a test case that is as simple as possible helps (e.g., remove LZ4 from the equation and just simulate its access patterns).

That said, it certainly is possible that something has changed here, as there has been a fair amount of development in this area lately. For example, huge page support is being added for the page cache, which could directly affect read/write and mmap of files. I'm not sure if it has hit mainline at the moment.

As a test you might consider turning off transparent huge pages and seeing if that has any effect. Certain that one setting is a performance landmine for some workflows (I've seen it blow things up many times).

Is squash using mmap only for the input side, or does it also mmap a writeable area for the output buffer? When writes are in play things get a lot more complicated/less predictable.

nemequ commented 9 years ago

Yeah, lots more testing is necessary.

As a test you might consider turning off transparent huge pages and seeing if that has any effect. Certain that one setting is a performance landmine for some workflows (I've seen it blow things up many times).

Good to know, I'll certainly try it.

Is squash using mmap only for the input side, or does it also mmap a writeable area for the output buffer? When writes are in play things get a lot more complicated/less predictable.

Both. And yes, since MAP_POPULATE makes no difference I'm pretty sure the issue is on the write side. Unfortunately I'm really busy for the rest of the month, so this will have to wait a bit. I spent too much time playing with Squash in the last couple weeks, need to catch up in other areas…

nemequ commented 9 years ago

I haven't done extensive testing (i.e., a complete run of the benchmark on all the machines), but I think using huge pages fixes this issue. If you are still seeing this issue, please reopen quixdb/squash#125 (or comment on it and ask me to reopen… not sure how permissions around reopening issues work on GitHub).

I'll try to update the benchmark results soon. I'd like to get Squash 0.8 out in the next couple weeks, but if it is going to take longer than that I'll go ahead and update the results independently of a new release.