Faster unaligned access for strict alignment archs

GoogleCodeExporter commented 9 years ago

The load/store functions for processor with strict alignment requirements are a 
bit slow with the memcpy.

Use packed structs instead and let gcc generate the code. x86/amd64 
(confirmed)/arm/powerpc should be unaffected by this patch, but sparc64 and 
arm5 on qemu are positively affected.

Unfortunately gcc currently generates byte loads on armv7, otherwise this 
tecnique could also be used there.

Microbench for a Sun Blade 100:
Before:

Benchmark            Time(ns)    CPU(ns) Iterations
---------------------------------------------------
BM_UFlat/0           29974560    7405480        100 13.2MB/s  html
BM_UFlat/1          130382570   63129410        100 10.6MB/s  urls
BM_UFlat/2            2405120    1125410        100 107.6MB/s  jpg
BM_UFlat/3            5668360    2906470        100 31.0MB/s  pdf
BM_UFlat/4           58851840   29352660        100 13.3MB/s  html4
BM_UFlat/5            4686650    2501920        100 9.4MB/s  cp
BM_UFlat/6            2650380    1255770        100 8.5MB/s  c
BM_UFlat/7             748865     376443        223 9.4MB/s  lsp
BM_UFlat/8          268912400  129768490        100 7.6MB/s  xls
BM_UFlat/9           47438510   23831620        100 6.1MB/s  txt1
BM_UFlat/10          41493250   20543940        100 5.8MB/s  txt2
BM_UFlat/11         127872670   62720320        100 6.5MB/s  txt3
BM_UFlat/12         170874680   85853740        100 5.4MB/s  txt4
BM_UFlat/13          71448340   35158120        100 13.9MB/s  bin
BM_UFlat/14           8907600    4547880        100 8.0MB/s  sum
BM_UFlat/15            928397     501960        176 8.0MB/s  man
BM_UFlat/16           7590250    7513280        100 15.1MB/s  pb
BM_UFlat/17          25014980   24606870        100 7.1MB/s  gaviota
BM_UValidate/0         508184     506605        395 192.8MB/s  html
BM_UValidate/1        5646340    5613680        100 119.3MB/s  urls
BM_UValidate/2           3432       3422      49627 34.6GB/s  jpg
BM_UValidate/3         180518     180020       1100 499.7MB/s  pdf
BM_UValidate/4        2079160    2071610        100 188.6MB/s  html4
BM_ZFlat/0            6327910    6287640        100 15.5MB/s  html (23.49 %)
BM_ZFlat/1           83333510   82607690        100 8.1MB/s  urls (50.94 %)
BM_ZFlat/2            6057310    2979400        100 40.6MB/s  jpg (99.88 %)
BM_ZFlat/3            8111860    4088380        100 22.0MB/s  pdf (82.25 %)
BM_ZFlat/4           58168440   26531350        100 14.7MB/s  html4 (23.50 %)
BM_ZFlat/5            5900030    2908730        100 8.1MB/s  cp (48.10 %)
BM_ZFlat/6            2691980    1219010        100 8.7MB/s  c (42.45 %)
BM_ZFlat/7             810950     435750        100 8.1MB/s  lsp (48.59 %)
BM_ZFlat/8          261333010  126124680        100 7.8MB/s  xls (41.31 %)
BM_ZFlat/9           44381120   22191510        100 6.5MB/s  txt1 (59.76 %)
BM_ZFlat/10          40353150   19863210        100 6.0MB/s  txt2 (63.94 %)
BM_ZFlat/11         121624440   59279880        100 6.9MB/s  txt3 (57.21 %)
BM_ZFlat/12         162955320   80705680        100 5.7MB/s  txt4 (68.46 %)
BM_ZFlat/13          51654320   26552440        100 18.4MB/s  bin (18.19 %)
BM_ZFlat/14           9379100    4938480        100 7.4MB/s  sum (51.78 %)
BM_ZFlat/15           1027735     590339        159 6.8MB/s  man (59.43 %)
BM_ZFlat/16          14286240    7306860        100 15.5MB/s  pb (23.10 %)
BM_ZFlat/17          33690660   16850410        100 10.4MB/s  gaviota (38.31 %)

After:

Benchmark            Time(ns)    CPU(ns) Iterations
---------------------------------------------------
BM_UFlat/0            3093820    1463410        100 66.7MB/s  html
BM_UFlat/1           33096120   16115770        100 41.5MB/s  urls
BM_UFlat/2            2190890    1033020        100 117.2MB/s  jpg
BM_UFlat/3            2026260    1037050        100 86.7MB/s  pdf
BM_UFlat/4           14377380    7036590        100 55.5MB/s  html4
BM_UFlat/5            1063421     511805        185 45.8MB/s  cp
BM_UFlat/6             539027     245449        398 43.3MB/s  c
BM_UFlat/7             134933      70589       2720 50.3MB/s  lsp
BM_UFlat/8           56581910   27959160        100 35.1MB/s  xls
BM_UFlat/9           10150110    5022480        100 28.9MB/s  txt1
BM_UFlat/10           8328730    4263500        100 28.0MB/s  txt2
BM_UFlat/11          28303810   14006570        100 29.1MB/s  txt3
BM_UFlat/12          38039520   18725900        100 24.5MB/s  txt4
BM_UFlat/13          17805460    8851930        100 55.3MB/s  bin
BM_UFlat/14           1815809     905445        110 40.3MB/s  sum
BM_UFlat/15            205228      93269        728 43.2MB/s  man
BM_UFlat/16           3076170    1527850        100 74.0MB/s  pb
BM_UFlat/17          10848980    5289110        100 33.2MB/s  gaviota
BM_UValidate/0         645698     328509        302 297.3MB/s  html
BM_UValidate/1        8239280    4146250        100 161.5MB/s  urls
BM_UValidate/2           5343       2628      58997 45.0GB/s  jpg
BM_UValidate/3         217162     110680       1816 812.8MB/s  pdf
BM_UValidate/4        2811950    1394230        100 280.2MB/s  html4
BM_ZFlat/0            7403610    3712110        100 26.3MB/s  html (23.49 %)
BM_ZFlat/1          102016450   50426390        100 13.3MB/s  urls (50.94 %)
BM_ZFlat/2            4941900    2459980        100 49.2MB/s  jpg (99.88 %)
BM_ZFlat/3            5831210    2893880        100 31.1MB/s  pdf (82.25 %)
BM_ZFlat/4           30810990   15417520        100 25.3MB/s  html4 (23.50 %)
BM_ZFlat/5            3309410    1640150        100 14.3MB/s  cp (48.10 %)
BM_ZFlat/6            1302582     687849        146 15.5MB/s  c (42.45 %)
BM_ZFlat/7             473785     235792        433 15.0MB/s  lsp (48.59 %)
BM_ZFlat/8          120794050   59226060        100 16.6MB/s  xls (41.31 %)
BM_ZFlat/9           25348700   12568850        100 11.5MB/s  txt1 (59.76 %)
BM_ZFlat/10          21384620   10774500        100 11.1MB/s  txt2 (63.94 %)
BM_ZFlat/11          67171150   33915460        100 12.0MB/s  txt3 (57.21 %)
BM_ZFlat/12          87047350   43940820        100 10.5MB/s  txt4 (68.46 %)
BM_ZFlat/13          32070570   15450910        100 31.7MB/s  bin (18.19 %)
BM_ZFlat/14           5306620    2697740        100 13.5MB/s  sum (51.78 %)
BM_ZFlat/15            645942     319014        278 12.6MB/s  man (59.43 %)
BM_ZFlat/16           8523590    4203500        100 26.9MB/s  pb (23.10 %)
BM_ZFlat/17          20531470    9949020        100 17.7MB/s  gaviota (38.31 %)

Original issue reported on code.google.com by skrab...@gmail.com on 19 Apr 2012 at 7:38

Attachments:

unaligned.patch

GoogleCodeExporter commented 9 years ago

Hi,

What compiler is this? Ideally the compiler should generate good code for small 
constant memcpy, but I'm fully aware this is not always happening.

On first thought, it seems like you're mashing things a bit together; the 
struct trick is nice, but you're not using it for big-endian architectures at 
all? This is confusing; I don't immediately see why byte-for-byte loads should 
be faster on _all_ big-endian architectures (vs. unaligned load + swap).

Note that while ARM performance is interesting to us, _simulated_ ARM 
performance (QEMU) is not.

Original comment by sgunder...@bigfoot.com on 20 Apr 2012 at 9:36

GoogleCodeExporter commented 9 years ago

Original comment by se...@google.com on 20 Apr 2012 at 9:42

Added labels: Type-Enhancement
Removed labels: Type-Defect

GoogleCodeExporter commented 9 years ago

Hi,

For Sparc it is gcc-4.2.1 and the code generated for UNALIGNED_LOAD32 with 
memcpy is 4 bytes loads, 4 byte stores (this is the memcpy) followed by an 
uint32 load. So the memcpy is optimized all right, but gcc doesn't understand 
where the bytes are going. On ARMv5 and gcc-4.6.1, the call to memcpy is not 
optimized at all. 

The struct trick is used for the loads/stores in native endian order and 
byte-for-byte is used when little endian order is needed. Again gcc has trouble 
optimizing the two steps load + swap.

Byte-for-byte loads are not faster on_all_ big endian archs, for example not on 
for powerpc, which is excluded as before. But powerpc is unusual in that it can 
access unaligned little endian data with one instruction.

Note: If you have real pre-arm7 hardware, I think you will see a speedup as I 
did in qemu, but not exactly the same amount of speedup.

Original comment by skrab...@gmail.com on 22 Apr 2012 at 7:02

GoogleCodeExporter commented 9 years ago

I don't think pre-arm7 is all that important for performance; perhaps we can 
leave that one out? As I understand it, the struct trick is primarily for 
SPARC, where you've already shown a real benefit. The other changes can go in 
if we find a real, non-emulated platform where people need the performance.

Here's what I think needs to happen:

1. Make a patch with only the struct trick, conditional on GCC + little-endian.
2. Take a look at the Google C++ style guide 
(http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml); in 
particular, it (and thus Snappy) doesn't use spaces inside (), and prefers 
static_cast<int>(x) to int(x). (Strange as I find the latter :-) ). I can clean 
this up if you don't want to, but it's of course easier if the patch already 
conforms to the style guide.
2. Please sign the Contributor License Agreement 
(http://code.google.com/legal/individual-cla-v1.0.html); it's a prerequisite 
for accepting non-trivial patches. It's a really simple procedure, though.

Original comment by se...@google.com on 24 Apr 2012 at 11:31

Changed title: Faster unaligned access for strict alignment archs
Changed state: Accepted
Added labels: Performance

GoogleCodeExporter commented 9 years ago

Hi,

What's the status on this bug?

Original comment by se...@google.com on 22 May 2012 at 9:46

GoogleCodeExporter commented 9 years ago

Hi,

Given that we can't accept these patches as they are (due to missing CLA 
signature, as a primary blocker), and there is no response from the bug 
reporter, I'm closing it as wontfix. Feel free to reopen at a later stage (and 
if so, see my plan in comment 4).

Original comment by se...@google.com on 1 Jun 2012 at 11:23

Changed state: WontFix

markcox / snappy

Faster unaligned access for strict alignment archs #61