sonjeheon / lz4

Automatically exported from code.google.com/p/lz4
0 stars 0 forks source link

ARM performance #7

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Any ideas for improving compress performance on ARM?  I'm running some 
benchmarks on an OpenVPN packet dataset (IPv4 packets, < 1500 bytes each, 
reflecting real-world web browsing sessions) and LZ4 is taking twice as much 
time as LZO to achieve roughly comparable compression ratios.

James

Original issue reported on code.google.com by caprifin...@gmail.com on 30 Jan 2012 at 9:24

GoogleCodeExporter commented 9 years ago
I'll handle this one.

Original comment by yann.col...@gmail.com on 30 Jan 2012 at 9:44

GoogleCodeExporter commented 9 years ago
Good point. ARM performance requires an ARM testbed for proper evaluation.
It costed me a lot of time to build one, but i guess i've got an almost working 
one today. I will look into it.

Original comment by yann.col...@gmail.com on 30 Jan 2012 at 9:45

GoogleCodeExporter commented 9 years ago
I'm using the BeagleBone running Ubuntu 11 as my testbed.  Under US$100.

James

Original comment by caprifin...@gmail.com on 30 Jan 2012 at 8:15

GoogleCodeExporter commented 9 years ago
As a quick test, you may want to "disable" code extra-precaution for 
strictly-aligned memory access CPU (such as many ARM) if *and only if* your ARM 
board does indeed support unaligned access.

I made a test with an ARM Cortex A8, which apparently supports unaligned 
access, and it increased speed by almost 50%.

To disable the precaution, it's enough to modify these lines :

#ifdef __GNUC__
//#define _PACKED __attribute__ ((packed))
#define _PACKED
#else
#define _PACKED
#endif

Regards

Original comment by yann.col...@gmail.com on 1 Feb 2012 at 10:37

GoogleCodeExporter commented 9 years ago
A proposed release candidate has been sent to your email.

It adds 2 features which may be of interest for your use case :

1) ARM and Unaligned Memory Access :

By default, LZ4 is very cautious with ARM processors, and entirely avoids the 
“unaligned memory” problem.
However, some newer ARM cpus are now able to handle properly unaligned memory 
access.
This makes a critical performance difference.

However, this new feature is not automatically discovered by today’s 
compilers, or i guess by most compilers.
A very recent pre-defined macros has been contributed by ARM to GCC, called 
__ARM_FEATURE_UNALIGNED.
I’ve integrated it, but unfortunately, it is too recent to be properly 
supported by current crop of compilers, maybe next generation.

Therefore, the only way to benefit this feature is to manually instruct the 
code to use it.
This can be done more easily now, with the following lines :

// Unaligned memory access ?
// This feature is automatically enabled for "common" CPU, such as x86.
// For others CPU, you may want to force this option manually to improve 
performance if your target CPU supports unaligned memory access
#if (__ARM_FEATURE_UNALIGNED)
#define CPU_UNALIGNED_ACCESS 1
#endif

You can force the detection to “1”, and it will gladly use unaligned memory 
access.
On the ARM Cortex A8 used for test, it resulted in a 50% performance increase.
On processors which do not support unaligned memory access, it will crash.

2) Incompressible segments detection

LZ4 can skip over incompressible segments.
It is more cautious than LZO in doing so.
Especially LZO 1x_1, the skipping is so strongth that it can quickly go through 
a perfectly compressible large file on the ground that a short segment was not.
Now, for very small packets, this weakness becomes a strength.

You can instruct LZ4 to skip incompressible segments faster, by lowering the 
confirmation level.
It is only necessary to modify this figure :

// NONCOMPRESSIBLE_CONFIRMATION :
// Increasing this value will make the algorithm search more before declaring a 
segment "incompressible"
// This could improve compression a bit, but will be slower on incompressible 
data
// Decreasing this value will make the algorithm declare its current segment 
"incompressible" much faster
// This may decrease compression ratio dramatically, but will be faster on 
incompressible data
// The default value (6) is recommended
#define NONCOMPRESSIBLE_CONFIRMATION 6

Finding the “optimal” value is a matter of use-case and test samples.

Best Regards

Original comment by yann.col...@gmail.com on 2 Feb 2012 at 9:22

GoogleCodeExporter commented 9 years ago
r54 has been published. It integrates the capability to manually force 
"unaligned memory access" on ARM processors which support it.

Original comment by yann.col...@gmail.com on 7 Feb 2012 at 5:04

GoogleCodeExporter commented 9 years ago
option added in r54

Original comment by yann.col...@gmail.com on 8 Feb 2012 at 12:33