t-mat / lz4mt

Platform independent, multi-threading implementation of lz4 stream in C++11
Other
199 stars 47 forks source link

issues with OS X build #16

Open kimbauters opened 11 years ago

kimbauters commented 11 years ago

There are two issues with the build on OS X. Firstly, there is an error in src/lz4mt_compat.cpp . Specifically, in the untested code there is a reference to count. however, count does not exist. I would assume that this should be &c instead of &count. Secondly, the LDFLAGS should not be "-lrt -pthread" on OS X since -lrt is, in general, not supported on OS X. I would assume that this needs to be changed to LDFLAGS = -pthread .

The resulting code can be compiled and seems to work as expected. I have compared against lz4c and the resulting archive can be decompressed and results in the original file. In terms of encoding speed, lz4mt is roughly 20% to 25% faster on a core i7 Haswell CPU (most likely because the single core performance is considerably boosted when using lz4c). However, decoding speed takes an 8% hit compared to lz4c. When using single thread mode, lz4mt is in general about 20% slower (for both encoding and decoding).

t-mat commented 11 years ago

Thanks for the report ! I'll investigate this problem.

(1) I have two questions:

(2) Your fix for LDFLAGS and &count looks right.

(3) And your benchmark is very interesting. On Linux and Windows, lz4mt is N^0.5 .. N^0.8 times faster than lz4c (N:Number of cores). I think my code has some problem for synchronizing.

kimbauters commented 11 years ago

Glad I could help.

(1) I am using a freshly compiled version of gcc 4.8.1. Everything is completely standard, so no graphite loops or any other fancy additions. See below for the output of gcc-4.8.1 -v: Using built-in specs. COLLECT_GCC=gcc-4.8.1 COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/x86_64-apple-darwin12.4.1/4.8.1/lto-wrapper Target: x86_64-apple-darwin12.4.1 Configured with: .././configure --enable-languages=c,c++ --program-suffix=-4.8.1 Thread model: posix gcc version 4.8.1 (GCC)

(3) The speedup is not too far off. My processor is an i7-4650U (http://ark.intel.com/products/75114), which is an ultra-low power dual core processor (with four threads). According to your estimate, I should see performance in the range of 140% to 175%. Perhaps my samples were worst-case scenarios. If you have any updated code in the future, I would be more than happy to test it for you and report back on the results.

edit: until now, I had run the benchmark on those files for which I would use lz4. However, these files are already heavily compressed, which I assume is a worst-case scenario when it comes to the speed difference between the standard implementation of lz4 and lz4mt. I reran the benchmarks and now picked a more suitable test case. In the first test I tarred and then compressed a folder containing mostly PDF files, along with a few htm, gif and txt files. This gave me a performance benefit with lz4mt of roughly 70%, so clearly in the ballpark. The second test was run on a folder containing epub files (i.e. zip files) and their unzipped version (mostly html files). This gave a very appreciable speedup of 120%. In terms of decoding, the speedup was roughly 15% and 25%, respectively. It thus seems that the problem of the slower decoding was mainly to do with the kinds of files that I used in my initial benchmark (mostly tar.bz2 files). Nevertheless, it does seem weird that test-cases can be selected where using multiple cores actually results in lower performance.

t-mat commented 11 years ago

Here is a my experiment (memo):

in short

TODO

Install gcc 4.8.1

$ brew install gcc48

$ gcc-4.8 -v
Using built-in specs.
COLLECT_GCC=gcc-4.8
COLLECT_LTO_WRAPPER=/usr/local/Cellar/gcc48/4.8.1/gcc/libexec/gcc/\
x86_64-apple-darwin11.4.2/4.8.1/lto-wrapper
Target: x86_64-apple-darwin11.4.2
Configured with: ../configure --build=x86_64-apple-darwin11.4.2 \
--prefix=/usr/local/Cellar/gcc48/4.8.1/gcc \
--datarootdir=/usr/local/Cellar/gcc48/4.8.1/share \
--bindir=/usr/local/Cellar/gcc48/4.8.1/bin \
--enable-languages=c --program-suffix=-4.8 --with-gmp=/usr/local/opt/gmp \
--with-mpfr=/usr/local/opt/mpfr --with-mpc=/usr/local/opt/libmpc \
--with-cloog=/usr/local/opt/cloog --with-isl=/usr/local/opt/isl \
--with-system-zlib --enable-libstdcxx-time=yes --enable-stage1-checking \
--enable-checking=release --enable-lto --disable-werror --enable-plugin \
--disable-nls --disable-multilib
Thread model: posix
gcc version 4.8.1 (GCC)

Check CPU Spec

$ sysctrl -a | grep brand_string
machdep.cpu.brand_string: Intel(R) Core(TM) i5-2467M CPU @ 1.60GHz

Benchmark on ramdisk

$ diskutil erasevolume HFS+ 'MyRamDisk512Mib' `hdiutil attach -nomount ram://1048576`
$ cd /Volumes/MyRamDisk512Mib
$ curl -O https://cs.fit.edu/~mmahoney/compression/enwik8.bz2
$ bzip2 -dk enwik8.bz2
$ ln -s /your/path/to/lz4c
$ ln -s /your/path/to/lz4mt

$ ./lz4c -c -y enwik8 enwik8.lz4c
*** LZ4 Compression CLI , by Yann Collet (Jul  9 2013) ***
Compressed 100000000 bytes into 56995506 bytes ==> 57.00%
Done in 0.68 s ==> 139.66 MB/s

$ ./lz4mt -c -y enwik8 enwik8.lz4mt
Total time: 0.31985sec

$ cmp -b enwik8.lz4c enwik8.lz4mt

$ ./lz4c -d -y enwik8.lz4c enwik8.out
*** LZ4 Compression CLI , by Yann Collet (Jul  9 2013) ***
Successfully decoded 100000000 bytes
Done in 0.34 s ==> 283.23 MB/s

$ cmp -b enwik8 enwik8.out

$ ./lz4mt -d -y enwik8.lz4c enwik8.out
Total time: 0.256552sec

$ cmp -b enwik8 enwik8.out

$ cd
$ hdiutil detach /Volumes/MyRamDisk512Mib