Unit tests do not pass on some ARM machines

GoogleCodeExporter commented 9 years ago

Performance on ARMv7 appears to be much slower than LZO or LZ4 running on same 
platform, and self-test fails.

james@omap:~/snappy-1.0.4$ uname -a
Linux omap 3.1.0-psp3 #1 SMP Fri Dec 23 10:44:55 UTC 2011 armv7l armv7l armv7l 
GNU/Linux

james@omap:~/snappy-1.0.4$ cat /proc/cpuinfo
Processor       : ARMv7 Processor rev 2 (v7l)
processor       : 0
BogoMIPS        : 718.02

Features        : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls 
CPU implementer : 0x41
CPU architecture: 7
CPU variant     : 0x3
CPU part        : 0xc08
CPU revision    : 2

Hardware        : am335xevm
Revision        : 0000
Serial          : 0000000000000000

james@omap:~/snappy-1.0.4$ make check
make  check-TESTS
make[1]: Entering directory `/home/james/snappy-1.0.4'
Running microbenchmarks.
WARNING: Compiled with assertions enabled, will be slow.
Benchmark            Time(ns)    CPU(ns) Iterations
---------------------------------------------------
BM_UFlat/0            3961260    4000000        100 24.4MB/s  html
BM_UFlat/1           38568150   38600000        100 17.3MB/s  urls
BM_UFlat/2             156067     150627       1195 803.8MB/s  jpg
BM_UFlat/3            1301098    1307189        153 68.8MB/s  pdf
BM_UFlat/4           17074720   17100000        100 22.8MB/s  html4
BM_UFlat/5            1434086    1449275        138 16.2MB/s  cp
BM_UFlat/6             728445     729927        274 14.6MB/s  c
BM_UFlat/7             219116     220507        907 16.1MB/s  lsp
BM_UFlat/8           77818400   77800000        100 12.6MB/s  xls
BM_UFlat/9           13947660   14000000        100 10.4MB/s  txt1
BM_UFlat/10          11871930   11900000        100 10.0MB/s  txt2
BM_UFlat/11          38183210   38200000        100 10.7MB/s  txt3
BM_UFlat/12          51673350   51600000        100 8.9MB/s  txt4
BM_UFlat/13          19561530   19600000        100 25.0MB/s  bin
BM_UFlat/14           2444890    2500000        100 14.6MB/s  sum
BM_UFlat/15            287548     289435        691 13.9MB/s  man
BM_UFlat/16           3912420    3900000        100 29.0MB/s  pb
BM_UFlat/17          14461740   14400000        100 12.2MB/s  gaviota
BM_UValidate/0         588306     589970        339 165.5MB/s  html
BM_UValidate/1        6236550    6200000        100 108.0MB/s  urls
BM_UValidate/2           3098       3102      54794 38.1GB/s  jpg
BM_UValidate/3         211292     212314        942 423.7MB/s  pdf
BM_UValidate/4        2383650    2400000        100 162.8MB/s  html4
BM_ZFlat/0            9082470    9100000        100 10.7MB/s  html (23.57 %)
BM_ZFlat/1          118292900  118200000        100 5.7MB/s  urls (50.89 %)
BM_ZFlat/2            2067640    2100000        100 57.7MB/s  jpg (99.88 %)
BM_ZFlat/3            4835760    4900000        100 18.4MB/s  pdf (82.13 %)
BM_ZFlat/4           37912630   37700000        100 10.4MB/s  html4 (23.55 %)
BM_ZFlat/5            3893540    3900000        100 6.0MB/s  cp (48.12 %)
BM_ZFlat/6            1592216    1600000        125 6.6MB/s  c (42.40 %)
BM_ZFlat/7             600933     606060        330 5.9MB/s  lsp (48.37 %)
BM_ZFlat/8          144899840  144800000        100 6.8MB/s  xls (41.34 %)
BM_ZFlat/9           28264910   28300000        100 5.1MB/s  txt1 (59.81 %)
BM_ZFlat/10          24717700   24700000        100 4.8MB/s  txt2 (64.07 %)
BM_ZFlat/11          77832700   77800000        100 5.2MB/s  txt3 (57.11 %)
BM_ZFlat/12         101706760  101700000        100 4.5MB/s  txt4 (68.35 %)
BM_ZFlat/13          39321220   39300000        100 12.5MB/s  bin (18.21 %)
BM_ZFlat/14           6398750    6400000        100 5.7MB/s  sum (51.88 %)
BM_ZFlat/15            804576     806451        248 5.0MB/s  man (59.36 %)
BM_ZFlat/16          10130580   10100000        100 11.2MB/s  pb (23.15 %)
BM_ZFlat/17          22898190   22900000        100 7.7MB/s  gaviota (38.27 %)

Running correctness tests.
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
/bin/bash: line 5: 10238 Aborted                 ${dir}$tst
FAIL: snappy_unittest
==================
1 of 1 test failed
==================
make[1]: *** [check-TESTS] Error 1
make[1]: Leaving directory `/home/james/snappy-1.0.4'
make: *** [check-am] Error 2

Original issue reported on code.google.com by caprifin...@gmail.com on 30 Jan 2012 at 9:14

GoogleCodeExporter commented 9 years ago

Hi,

“It crashes” alone isn't very useful; could you get a backtrace? See e.g. 
http://wiki.debian.org/HowToGetABacktrace . Snappy has run fine on ARM-based 
systems in the past; we might have broken something, of course.

Also, it's meaningless to compare speed until the code actually passes the 
tests (not to mention that you're compiling with assertions, cf. “WARNING: 
Compiled with assertions enabled, will be slow.”). That, and you haven't 
given any speed numbers for LZO or LZ4 on your machine :-)

Original comment by sgunder...@bigfoot.com on 30 Jan 2012 at 10:24

GoogleCodeExporter commented 9 years ago

Original comment by se...@google.com on 30 Jan 2012 at 10:24

Changed title: Unit tests do not pass on ARM

GoogleCodeExporter commented 9 years ago

Hi,

The tests passed on a Cortex-A9, so this is something specific to your system. 
You'll definitely need to provide a backtrace, as well as your system 
specifications (hardware and software).

Original comment by se...@google.com on 30 Jan 2012 at 8:17

Changed title: Unit tests do not pass on some ARM machines

GoogleCodeExporter commented 9 years ago

Backtrace was useless because of stack corruption.  However, I was able to 
manually narrow down the failing test to this code in snappy_unittest.cc:

    dest[0] = dest[1] = dest[2] = 0xff;
    dest[3] = 0x7f;
    CHECK(!IsValidCompressedBuffer(TypeParam(dest)));
    CHECK(!Uncompress(TypeParam(dest), &uncmp)); // crash happens here

I'm running the test on a BeagleBone SBC using a TI AM3358 ARM Cortex-A8-based 
processor (ARMv7) running Ubuntu 11.10 (ARM EABI distribution).

Let me know if you need any more details.

James

Original comment by caprifin...@gmail.com on 30 Jan 2012 at 10:07

GoogleCodeExporter commented 9 years ago

This particular test happens to require 256MB of RAM, as long as 
std::string::max_size() is not lower. How much RAM/swap do you have?

Original comment by se...@google.com on 30 Jan 2012 at 11:19

GoogleCodeExporter commented 9 years ago

Aha, that's probably it.  This system only has 256MB total RAM.

When I comment out that particular test, everything runs fine.  It would be 
nice if "make check" or snappy_unittest would automatically show the comparison 
table between Snappy and other algs detected at configure time.

However the speed still seems slow.  I'm running my own test benchmark to 
simulate VPN compression (IPv4 packets archived from web browsing sessions) and 
the performance still lags considerably behind LZO and LZ4.  The packets are < 
1400 bytes and many are already compressed.  It would be nice if snappy could 
quickly ascertain whether or not a packet is compressed before spending too 
many cycles on it.

Performance of the same test on x86 64-bit is good, with Snappy leading the 
pack.

James

Original comment by caprifin...@gmail.com on 31 Jan 2012 at 12:49

GoogleCodeExporter commented 9 years ago

If you install the Google Flags package, you can do something like 
./snappy_unittest --zlib --lzo --run_microbenchmarks=false benchmarkfile1 
benchmarkfile2 etc., and it will compare for you (no nice table, though).

You are still running with assertions on. Add -DNDEBUG to your CFLAGS and 
CXXFLAGS and you should see some speed increase. x86 is still markedly faster 
than ARM, though.

Snappy already detects compressed data and skips it, but for 1400-byte packets 
it will probably not really start doing this until the data is over.

Original comment by se...@google.com on 31 Jan 2012 at 9:20

GoogleCodeExporter commented 9 years ago

It would be very useful if there was a parameter that could be passed to 
compress alg (call it punt_length) that would tell the compressor to examine no 
more than punt_length bytes of input before deciding if input is already 
compressed.  If it is compressed, return with error code rather than continuing 
to transfer data from input -> compressed.

This would allow apps to calibrate the tradeoff between compressing as much as 
possible and spending as few cycles as possible to identify already compressed 
data.

BTW, the relative slowness I observed in comment 6 was after building with 
-DNDEBUG.

James

Original comment by caprifin...@gmail.com on 31 Jan 2012 at 7:10

GoogleCodeExporter commented 9 years ago

Re punt_length, the algorithm doesn't really work that way; it's adaptive to 
the input. There's no single fixed point at which the compressor decides that 
the data is noncompressible.

As for general performance, no special performance tuning has been done for 
Snappy on ARM; if you want to break out the profiler and take a look at what 
can be done, you're more than welcome, but it's not a priority for us currently.

Original comment by sgunder...@bigfoot.com on 31 Jan 2012 at 7:14

GoogleCodeExporter commented 9 years ago

By the way, if you have a Cortex-A8, you should have native unaligned accesses. 
You could try modifying this line of snappy-stubs-internal.h:

  #if defined(__i386__) || defined(__x86_64__) || defined(__powerpc__)

to read something like

  #if 1

and see if it helps anything.

Original comment by se...@google.com on 31 Jan 2012 at 8:10

GoogleCodeExporter commented 9 years ago

The Cortex-A8 appears to allow unaligned access for 16 and 32 bit words but not 
64.

When I mod the load/store code as such, I get very competitive results on the 
benchmarks:

#define UNALIGNED_LOAD16(_p) (*reinterpret_cast<const uint16 *>(_p))
#define UNALIGNED_LOAD32(_p) (*reinterpret_cast<const uint32 *>(_p))

#define UNALIGNED_STORE16(_p, _val) (*reinterpret_cast<uint16 *>(_p) = (_val))
#define UNALIGNED_STORE32(_p, _val) (*reinterpret_cast<uint32 *>(_p) = (_val))

inline uint64 UNALIGNED_LOAD64(const void *p) {
  uint64 t;
  reinterpret_cast<uint32 *>(&t)[0] = reinterpret_cast<const uint32 *>(p)[0];
  reinterpret_cast<uint32 *>(&t)[1] = reinterpret_cast<const uint32 *>(p)[1];
  return t;
}

inline void UNALIGNED_STORE64(void *p, uint64 v) {
  reinterpret_cast<uint32 *>(p)[0] = reinterpret_cast<const uint32 *>(&v)[0];
  reinterpret_cast<uint32 *>(p)[1] = reinterpret_cast<const uint32 *>(&v)[1];
}

James

Original comment by caprifin...@gmail.com on 31 Jan 2012 at 11:19

GoogleCodeExporter commented 9 years ago

Oh sigh. I'll have to talk to my local ARM expert and try to figure out what's 
going on with the 64-bit loads/stores.

We're also discussing a mitigation of the 256MB issue. It's a bit unfortunate 
these two got munged together in the same bug report, but I guess we'll manage.

Original comment by se...@google.com on 31 Jan 2012 at 11:24

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

Re: punt_length, what I'm getting at is having a feature where the running 
compression ratio would be evaluated at punt_length and the compression would 
be aborted if the ratio at that point is not below a given threshold.

James

Original comment by caprifin...@gmail.com on 31 Jan 2012 at 11:29

GoogleCodeExporter commented 9 years ago

That's pretty tricky to get right, though. All matches (which are what gives 
you compression) are naturally going backwards, so the first part is bound to 
be the least compressible in any typical packet. If you could implement this 
and show that it gives a real improvement, I guess we could discuss it, but 
remember that any check, especially one in the inner loop as this, takes a 
non-zero amount of time to do itself.

You could always try adjusting the constants for the heuristic match skipping 
in snappy.cc; change the 32 and 5 to, e.g. 8 and 3 (the first is typically set 
to 2^(the other one)), and see if it helps. This will make the skipping 
significantly more aggressive.

Original comment by se...@google.com on 31 Jan 2012 at 11:36

GoogleCodeExporter commented 9 years ago

I'm being told unaligned 64-bit loads and stores are only allowed through NEON. 
You could take a look at the intrinsics there and see if they help.

Original comment by se...@google.com on 1 Feb 2012 at 12:06

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

A few updates on this: I got hold of a Tegra2-based machine, which is a 
Cortex-A9 (without NEON, though). It's somewhat hard to benchmark on since the 
number of iterations are so few and the timer resolution sometimes pretty low, 
but it's ages better than running on a cell phone. :-)

I was able to verify that your patch gives a nice boost (20–30%). I'll see to 
it that something similar to it is added to the distribution, but I'll have to 
stare some at the assembler first to make sure the 64-bit version is indeed 
optimal. I wasn't able to test 64-bit unaligned loads/stores, since I don't 
have NEON, but your Cortex-A8 might?

Also, there are more things to do if we want to tune for ARM specifically. In 
particular, the tricks for 64-bit machines are not a win on ARM; you can win 
10% by the patch I've attached. However, 64-bit x86 is much more important for 
us currently, so this might not happen unless we can find a clean way of 
integrating it. I would appreciate it if you could try it on your own machine 
and see if it helps any, though.

Finally, we're discussing reducing the maximum block to decode to 2MB instead 
of std::max_size(), to fix issues like this, and also potentially DoS 
opportunities.

Original comment by se...@google.com on 9 Feb 2012 at 1:52

Attachments:

snappy-remove-getuint32atoffset.diff

GoogleCodeExporter commented 9 years ago

Also, seemingly you can win 5-15% decompression performance by removing the 
first fast path in SnappyArrayWriter::AppendFromSelf (insert a "false &&" 
before the test for len <= 16 etc.). Seemingly the branch predictor is weak 
enough that the cost of the branch outweighs the advantages, at least on 
Cortex-A9. It's near-impossible to make this kind of tuning over multiple 
platforms on a stable basis, though.

Original comment by se...@google.com on 10 Feb 2012 at 4:08

GoogleCodeExporter commented 9 years ago

The crash issue was fixed in r58.

Original comment by se...@google.com on 11 Feb 2012 at 10:12

GoogleCodeExporter commented 9 years ago

r59 adds support for unaligned reads and writes on ARMv7 and higher, which I 
believe should fix the worst of the performance disparities. I'm sure there is 
more tuning that we can do, but I consider this bug fixed, so closing.

Original comment by se...@google.com on 21 Feb 2012 at 5:04

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

...and r60 removes the 64-bit load optimization during compression for ARM and 
other non-x64 platforms.

Original comment by sgunder...@bigfoot.com on 23 Feb 2012 at 5:01

krfkeith / snappy

Unit tests do not pass on some ARM machines #59