Closed GoogleCodeExporter closed 9 years ago
Salut Valery !
Ca faisait longtemps ! C'est donc bien toi lzv :
http://mattmahoney.net/dc/text.html#4884
(continuing in english)
Thanks for the tip.
I'll definately look into it.
To be fair, it's a bit surprising, because the speed of this loop was believed
to depend on NOT doing any test before copying byte (kind of "speculative
write" which on average works well). Your result seems to change that
assumption, so that sure is interesting to look at.
A bientôt
Original comment by yann.col...@gmail.com
on 19 Jan 2014 at 2:16
I made a few preliminary tests using your proposed modified macro, using Visual
Studio 2012 compiler.
The results are not good so far, with compression speed barely affected (below
1% so equivalent to measurement noise), but a more noticeable decompression
speed decrease (from 1800 MB/s to 1680 MB/s on average) across a range a sample
files, including enwik8.
The difference could be attributed to using a different compiler. So tests
remained to be done with GCC. Nonetheless, it shows such kind of optimization
tend to be system specific, and should therefore be handled with care.
Original comment by yann.col...@gmail.com
on 19 Jan 2014 at 3:22
I've committed an oversight: "if (e-d <= 8)" should be if "(likely(e-d <= 8))".
This way, I get 7% speed on compression and 42% on decompression for enwik8.
To understand why this works, let's look at the assembly generated by gcc.
void lz4_wildcopy_1(BYTE *d, const BYTE *s, const BYTE *e)
{
do { LZ4_COPY8(d,s) } while (d<e);
}
void lz4_wildcopy_2(BYTE *d, const BYTE *s, const BYTE *e)
{
if (likely(e-d <= 8)) LZ4_COPY8(d,s) else
do { LZ4_COPY8(d,s) } while (d<e);
}
Compile with: gcc -O3 -S lz4.c.
Code for lz4_wildcopy_1 looks like this:
movl 20(%esp), %eax
movl 24(%esp), %esi
movl %eax, %edi
movl %esi, %edx
addl $4, %edi
leal 4(%esi), %ebp
.p2align 4,,7
L3:
movl (%edx), %ecx
movl %ecx, (%eax)
movl %edx, %ecx
addl $8, %edx
subl %esi, %ecx
movl 0(%ebp,%ecx), %ebx
movl %eax, %ecx
addl $8, %eax
subl 20(%esp), %ecx
cmpl %eax, 28(%esp)
movl %ebx, (%edi,%ecx)
ja L3
The L3 loop is the copying loop, which is always run at least once.
Compare with the code for lz4_wildcopy_2:
movl 28(%esp), %eax
subl 20(%esp), %eax
movl 24(%esp), %esi
cmpl $8, %eax
jle L11
# same code as _lz4_wildcopy_1, skipped
L11:
.cfi_restore_state
movl 20(%esp), %edi
movl (%esi), %eax
movl %eax, (%edi)
movl 4(%esi), %eax
movl %eax, 4(%edi)
The first part uses 5 instructions to load parameters, compute e - d, and jump
to the fast path (L11) if appropriate. Otherwise, it runs the same code as
lz4_wildcopy_1 (skipped). The fast path itself is reduced to just 4
instructions that copy data.
The initial test is cheap since it can be predicted (for enwik8 at least), and
the reward is huge because we avoid all the cruft of the general case.
I don't know why this doesn't work with MSVC. Clang too fails to optimize to
the same level as gcc.
Original comment by valery.c...@gmail.com
on 19 Jan 2014 at 4:25
Thanks for detailed explanations.
I did test your variant with GCC / MinGW 32-bits on Windows, and it indeed
shows a speed improvement, especially on the decompression side, which improved
from 1080 to 1530 MB/s. likely statement pushed it even further to 1620. It's a
very large improvement.
I shall test again using a Linux system, I'm guessing I'll find equivalent
results.
If this improvement is targeted at GCC 32-bits, probably the #if selector
should test this specific configuration.
I'm guessing the Visual version doesn't benefit from your macro improvement
probably because it already delivers better speed : at 1800 MB/s, it's still
either than GCC with your improvement.
Regards
Original comment by yann.col...@gmail.com
on 19 Jan 2014 at 9:30
Original comment by yann.col...@gmail.com
on 19 Jan 2014 at 9:30
Original comment by yann.col...@gmail.com
on 19 Jan 2014 at 9:30
On my laptop, LZ4/GCC runs at about the same speed as yours, but LZ4/MSVC never
reaches 1000 MB/s.
I compile LZ4 as C++ with Visual Studio 2013 Express and all optimizations
turned on, including global optimization and the like. Are there any other
trick to use to reach such a speed ?
Maybe this just depends on the architecture...
Original comment by valery.c...@gmail.com
on 20 Jan 2014 at 8:29
Strange.
I'm using Visual Studio 2012 currently, and compile using C (not C++); I also
remember having similar results with Visual 2010 previously.
Never tested Studio 2013 though...
Original comment by yann.col...@gmail.com
on 20 Jan 2014 at 8:31
The following attached file is a release candidate version implementing your
suggested improvement.
Original comment by yann.col...@gmail.com
on 23 Jan 2014 at 9:33
Attachments:
Looks good to me.
This speeds up 32-bit GCC as expected.
Original comment by valery.c...@gmail.com
on 23 Jan 2014 at 10:04
Yes quite an impressive gain for this configuration
Original comment by yann.col...@gmail.com
on 23 Jan 2014 at 10:05
The proposed improvement has been integrated into r113
Original comment by yann.col...@gmail.com
on 4 Feb 2014 at 2:13
Original issue reported on code.google.com by
valery.c...@gmail.com
on 19 Jan 2014 at 2:02