Closed GoogleCodeExporter closed 9 years ago
Hi,
“It crashes” alone isn't very useful; could you get a backtrace? See e.g.
http://wiki.debian.org/HowToGetABacktrace . Snappy has run fine on ARM-based
systems in the past; we might have broken something, of course.
Also, it's meaningless to compare speed until the code actually passes the
tests (not to mention that you're compiling with assertions, cf. “WARNING:
Compiled with assertions enabled, will be slow.”). That, and you haven't
given any speed numbers for LZO or LZ4 on your machine :-)
Original comment by sgunder...@bigfoot.com
on 30 Jan 2012 at 10:24
Original comment by se...@google.com
on 30 Jan 2012 at 10:24
Hi,
The tests passed on a Cortex-A9, so this is something specific to your system.
You'll definitely need to provide a backtrace, as well as your system
specifications (hardware and software).
Original comment by se...@google.com
on 30 Jan 2012 at 8:17
Backtrace was useless because of stack corruption. However, I was able to
manually narrow down the failing test to this code in snappy_unittest.cc:
dest[0] = dest[1] = dest[2] = 0xff;
dest[3] = 0x7f;
CHECK(!IsValidCompressedBuffer(TypeParam(dest)));
CHECK(!Uncompress(TypeParam(dest), &uncmp)); // crash happens here
I'm running the test on a BeagleBone SBC using a TI AM3358 ARM Cortex-A8-based
processor (ARMv7) running Ubuntu 11.10 (ARM EABI distribution).
Let me know if you need any more details.
James
Original comment by caprifin...@gmail.com
on 30 Jan 2012 at 10:07
This particular test happens to require 256MB of RAM, as long as
std::string::max_size() is not lower. How much RAM/swap do you have?
Original comment by se...@google.com
on 30 Jan 2012 at 11:19
Aha, that's probably it. This system only has 256MB total RAM.
When I comment out that particular test, everything runs fine. It would be
nice if "make check" or snappy_unittest would automatically show the comparison
table between Snappy and other algs detected at configure time.
However the speed still seems slow. I'm running my own test benchmark to
simulate VPN compression (IPv4 packets archived from web browsing sessions) and
the performance still lags considerably behind LZO and LZ4. The packets are <
1400 bytes and many are already compressed. It would be nice if snappy could
quickly ascertain whether or not a packet is compressed before spending too
many cycles on it.
Performance of the same test on x86 64-bit is good, with Snappy leading the
pack.
James
Original comment by caprifin...@gmail.com
on 31 Jan 2012 at 12:49
If you install the Google Flags package, you can do something like
./snappy_unittest --zlib --lzo --run_microbenchmarks=false benchmarkfile1
benchmarkfile2 etc., and it will compare for you (no nice table, though).
You are still running with assertions on. Add -DNDEBUG to your CFLAGS and
CXXFLAGS and you should see some speed increase. x86 is still markedly faster
than ARM, though.
Snappy already detects compressed data and skips it, but for 1400-byte packets
it will probably not really start doing this until the data is over.
Original comment by se...@google.com
on 31 Jan 2012 at 9:20
It would be very useful if there was a parameter that could be passed to
compress alg (call it punt_length) that would tell the compressor to examine no
more than punt_length bytes of input before deciding if input is already
compressed. If it is compressed, return with error code rather than continuing
to transfer data from input -> compressed.
This would allow apps to calibrate the tradeoff between compressing as much as
possible and spending as few cycles as possible to identify already compressed
data.
BTW, the relative slowness I observed in comment 6 was after building with
-DNDEBUG.
James
Original comment by caprifin...@gmail.com
on 31 Jan 2012 at 7:10
Re punt_length, the algorithm doesn't really work that way; it's adaptive to
the input. There's no single fixed point at which the compressor decides that
the data is noncompressible.
As for general performance, no special performance tuning has been done for
Snappy on ARM; if you want to break out the profiler and take a look at what
can be done, you're more than welcome, but it's not a priority for us currently.
Original comment by sgunder...@bigfoot.com
on 31 Jan 2012 at 7:14
By the way, if you have a Cortex-A8, you should have native unaligned accesses.
You could try modifying this line of snappy-stubs-internal.h:
#if defined(__i386__) || defined(__x86_64__) || defined(__powerpc__)
to read something like
#if 1
and see if it helps anything.
Original comment by se...@google.com
on 31 Jan 2012 at 8:10
The Cortex-A8 appears to allow unaligned access for 16 and 32 bit words but not
64.
When I mod the load/store code as such, I get very competitive results on the
benchmarks:
#define UNALIGNED_LOAD16(_p) (*reinterpret_cast<const uint16 *>(_p))
#define UNALIGNED_LOAD32(_p) (*reinterpret_cast<const uint32 *>(_p))
#define UNALIGNED_STORE16(_p, _val) (*reinterpret_cast<uint16 *>(_p) = (_val))
#define UNALIGNED_STORE32(_p, _val) (*reinterpret_cast<uint32 *>(_p) = (_val))
inline uint64 UNALIGNED_LOAD64(const void *p) {
uint64 t;
reinterpret_cast<uint32 *>(&t)[0] = reinterpret_cast<const uint32 *>(p)[0];
reinterpret_cast<uint32 *>(&t)[1] = reinterpret_cast<const uint32 *>(p)[1];
return t;
}
inline void UNALIGNED_STORE64(void *p, uint64 v) {
reinterpret_cast<uint32 *>(p)[0] = reinterpret_cast<const uint32 *>(&v)[0];
reinterpret_cast<uint32 *>(p)[1] = reinterpret_cast<const uint32 *>(&v)[1];
}
James
Original comment by caprifin...@gmail.com
on 31 Jan 2012 at 11:19
Oh sigh. I'll have to talk to my local ARM expert and try to figure out what's
going on with the 64-bit loads/stores.
We're also discussing a mitigation of the 256MB issue. It's a bit unfortunate
these two got munged together in the same bug report, but I guess we'll manage.
Original comment by se...@google.com
on 31 Jan 2012 at 11:24
Re: punt_length, what I'm getting at is having a feature where the running
compression ratio would be evaluated at punt_length and the compression would
be aborted if the ratio at that point is not below a given threshold.
James
Original comment by caprifin...@gmail.com
on 31 Jan 2012 at 11:29
That's pretty tricky to get right, though. All matches (which are what gives
you compression) are naturally going backwards, so the first part is bound to
be the least compressible in any typical packet. If you could implement this
and show that it gives a real improvement, I guess we could discuss it, but
remember that any check, especially one in the inner loop as this, takes a
non-zero amount of time to do itself.
You could always try adjusting the constants for the heuristic match skipping
in snappy.cc; change the 32 and 5 to, e.g. 8 and 3 (the first is typically set
to 2^(the other one)), and see if it helps. This will make the skipping
significantly more aggressive.
Original comment by se...@google.com
on 31 Jan 2012 at 11:36
I'm being told unaligned 64-bit loads and stores are only allowed through NEON.
You could take a look at the intrinsics there and see if they help.
Original comment by se...@google.com
on 1 Feb 2012 at 12:06
[deleted comment]
A few updates on this: I got hold of a Tegra2-based machine, which is a
Cortex-A9 (without NEON, though). It's somewhat hard to benchmark on since the
number of iterations are so few and the timer resolution sometimes pretty low,
but it's ages better than running on a cell phone. :-)
I was able to verify that your patch gives a nice boost (20–30%). I'll see to
it that something similar to it is added to the distribution, but I'll have to
stare some at the assembler first to make sure the 64-bit version is indeed
optimal. I wasn't able to test 64-bit unaligned loads/stores, since I don't
have NEON, but your Cortex-A8 might?
Also, there are more things to do if we want to tune for ARM specifically. In
particular, the tricks for 64-bit machines are not a win on ARM; you can win
10% by the patch I've attached. However, 64-bit x86 is much more important for
us currently, so this might not happen unless we can find a clean way of
integrating it. I would appreciate it if you could try it on your own machine
and see if it helps any, though.
Finally, we're discussing reducing the maximum block to decode to 2MB instead
of std::max_size(), to fix issues like this, and also potentially DoS
opportunities.
Original comment by se...@google.com
on 9 Feb 2012 at 1:52
Attachments:
Also, seemingly you can win 5-15% decompression performance by removing the
first fast path in SnappyArrayWriter::AppendFromSelf (insert a "false &&"
before the test for len <= 16 etc.). Seemingly the branch predictor is weak
enough that the cost of the branch outweighs the advantages, at least on
Cortex-A9. It's near-impossible to make this kind of tuning over multiple
platforms on a stable basis, though.
Original comment by se...@google.com
on 10 Feb 2012 at 4:08
The crash issue was fixed in r58.
Original comment by se...@google.com
on 11 Feb 2012 at 10:12
r59 adds support for unaligned reads and writes on ARMv7 and higher, which I
believe should fix the worst of the performance disparities. I'm sure there is
more tuning that we can do, but I consider this bug fixed, so closing.
Original comment by se...@google.com
on 21 Feb 2012 at 5:04
...and r60 removes the 64-bit load optimization during compression for ARM and
other non-x64 platforms.
Original comment by sgunder...@bigfoot.com
on 23 Feb 2012 at 5:01
Original issue reported on code.google.com by
caprifin...@gmail.com
on 30 Jan 2012 at 9:14