trietptm / lz4

Automatically exported from code.google.com/p/lz4
0 stars 0 forks source link

Optimize LZ4_WILDCOPY #109

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Salut Yann,

On 32 bit platforms, LZ4_WILDCOPY may benefit from a bit of unrolling in the 
fast path.

if I replace its definition with the one below, I get a 6% speed boost for 
compression and 11% for decompression, when testing enwik8 in 32 bit mode on 
Windows 7, Core i5, gcc 4.8.1.

#if LZ4_ARCH64
#  define LZ4_WILDCOPY(d,s,e)     { do { LZ4_COPY8(d,s) } while (d<e); }        
   // at the end, d>=e;
#else
#  define LZ4_WILDCOPY(d,s,e)     { if (e-d <= 8) LZ4_COPY8(d,s) else do { 
LZ4_COPY8(d,s) } while (d<e); }
#endif

Original issue reported on code.google.com by valery.c...@gmail.com on 19 Jan 2014 at 2:02

GoogleCodeExporter commented 9 years ago
Salut Valery ! 

Ca faisait longtemps ! C'est donc bien toi lzv : 
http://mattmahoney.net/dc/text.html#4884

(continuing in english)
Thanks for the tip.
I'll definately look into it.
To be fair, it's a bit surprising, because the speed of this loop was believed 
to depend on NOT doing any test before copying byte (kind of "speculative 
write" which on average works well). Your result seems to change that 
assumption, so that sure is interesting to look at.

A bientôt

Original comment by yann.col...@gmail.com on 19 Jan 2014 at 2:16

GoogleCodeExporter commented 9 years ago
I made a few preliminary tests using your proposed modified macro, using Visual 
Studio 2012 compiler.

The results are not good so far, with compression speed barely affected (below 
1% so equivalent to measurement noise), but a more noticeable decompression 
speed decrease (from 1800 MB/s to 1680 MB/s on average) across a range a sample 
files, including enwik8.

The difference could be attributed to using a different compiler. So tests 
remained to be done with GCC. Nonetheless, it shows such kind of optimization 
tend to be system specific, and should therefore be handled with care.

Original comment by yann.col...@gmail.com on 19 Jan 2014 at 3:22

GoogleCodeExporter commented 9 years ago
I've committed an oversight: "if (e-d <= 8)" should be if "(likely(e-d <= 8))".
This way, I get 7% speed on compression and 42% on decompression for enwik8.

To understand why this works, let's look at the assembly generated by gcc.

void lz4_wildcopy_1(BYTE *d, const BYTE *s, const BYTE *e)
{
    do { LZ4_COPY8(d,s) } while (d<e);
}

void lz4_wildcopy_2(BYTE *d, const BYTE *s, const BYTE *e)
{
    if (likely(e-d <= 8)) LZ4_COPY8(d,s) else
    do { LZ4_COPY8(d,s) } while (d<e);
}

Compile with: gcc -O3 -S lz4.c.

Code for lz4_wildcopy_1 looks like this:
    movl    20(%esp), %eax
    movl    24(%esp), %esi
    movl    %eax, %edi
    movl    %esi, %edx
    addl    $4, %edi
    leal    4(%esi), %ebp
    .p2align 4,,7
L3:
    movl    (%edx), %ecx
    movl    %ecx, (%eax)
    movl    %edx, %ecx
    addl    $8, %edx
    subl    %esi, %ecx
    movl    0(%ebp,%ecx), %ebx
    movl    %eax, %ecx
    addl    $8, %eax
    subl    20(%esp), %ecx
    cmpl    %eax, 28(%esp)
    movl    %ebx, (%edi,%ecx)
    ja  L3

The L3 loop is the copying loop, which is always run at least once.

Compare with the code for lz4_wildcopy_2:
    movl    28(%esp), %eax
    subl    20(%esp), %eax
    movl    24(%esp), %esi
    cmpl    $8, %eax
    jle L11
    # same code as _lz4_wildcopy_1, skipped
L11:
    .cfi_restore_state
    movl    20(%esp), %edi
    movl    (%esi), %eax
    movl    %eax, (%edi)
    movl    4(%esi), %eax
    movl    %eax, 4(%edi)

The first part uses 5 instructions to load parameters, compute e - d, and jump 
to the fast path (L11) if appropriate. Otherwise, it runs the same code as 
lz4_wildcopy_1 (skipped). The fast path itself is reduced to just 4 
instructions that copy data.
The initial test is cheap since it can be predicted (for enwik8 at least), and 
the reward is huge because we avoid all the cruft of the general case.

I don't know why this doesn't work with MSVC. Clang too fails to optimize to 
the same level as gcc.

Original comment by valery.c...@gmail.com on 19 Jan 2014 at 4:25

GoogleCodeExporter commented 9 years ago
Thanks for detailed explanations.

I did test your variant with GCC / MinGW 32-bits on Windows, and it indeed 
shows a speed improvement, especially on the decompression side, which improved 
from 1080 to 1530 MB/s. likely statement pushed it even further to 1620. It's a 
very large improvement.

I shall test again using a Linux system, I'm guessing I'll find equivalent 
results.

If this improvement is targeted at GCC 32-bits, probably the #if selector 
should test this specific configuration.

I'm guessing the Visual version doesn't benefit from your macro improvement 
probably because it already delivers better speed : at 1800 MB/s, it's still 
either than GCC with your improvement.

Regards

Original comment by yann.col...@gmail.com on 19 Jan 2014 at 9:30

GoogleCodeExporter commented 9 years ago

Original comment by yann.col...@gmail.com on 19 Jan 2014 at 9:30

GoogleCodeExporter commented 9 years ago

Original comment by yann.col...@gmail.com on 19 Jan 2014 at 9:30

GoogleCodeExporter commented 9 years ago
On my laptop, LZ4/GCC runs at about the same speed as yours, but LZ4/MSVC never 
reaches 1000 MB/s.

I compile LZ4 as C++ with Visual Studio 2013 Express and all optimizations 
turned on, including global optimization and the like. Are there any other 
trick to use to reach such a speed ?
Maybe this just depends on the architecture...

Original comment by valery.c...@gmail.com on 20 Jan 2014 at 8:29

GoogleCodeExporter commented 9 years ago
Strange.
I'm using Visual Studio 2012 currently, and compile using C (not C++); I also 
remember having similar results with Visual 2010 previously. 
Never tested Studio 2013 though...

Original comment by yann.col...@gmail.com on 20 Jan 2014 at 8:31

GoogleCodeExporter commented 9 years ago
The following attached file is a release candidate version implementing your 
suggested improvement.

Original comment by yann.col...@gmail.com on 23 Jan 2014 at 9:33

Attachments:

GoogleCodeExporter commented 9 years ago
Looks good to me.

This speeds up 32-bit GCC as expected.

Original comment by valery.c...@gmail.com on 23 Jan 2014 at 10:04

GoogleCodeExporter commented 9 years ago
Yes quite an impressive gain for this configuration

Original comment by yann.col...@gmail.com on 23 Jan 2014 at 10:05

GoogleCodeExporter commented 9 years ago
The proposed improvement has been integrated into r113

Original comment by yann.col...@gmail.com on 4 Feb 2014 at 2:13