svent / sift

A fast and powerful alternative to grep
https://sift-tool.org
GNU General Public License v3.0
1.6k stars 108 forks source link

use klauspost's parallel gzip #38

Closed mnikhil-git closed 8 years ago

mnikhil-git commented 9 years ago

instead of stdlib gzip for parallel read

mnikhil-git commented 9 years ago

@svent Is it possible for you to share the sample data that you have used for benchmarking this and may be the hardware specs, perhaps any scripts to benchmark this? I would like to benchmark against the backdrop of current change.

LarryBattle commented 9 years ago

Could you provide a benchmark?

mnikhil-git commented 8 years ago

I will need help from @svent for this

svent commented 8 years ago

Sorry for the late reply on this PR - I finally did some benchmarks on this.

Searching through 800 small .gz files (800 files, 200 MB uncompressed):

Searching one big file (700 MB uncompressed):

So the PR acutally makes sift slower. This is not because that library is bad (I guess one can find examples where the performance is slightly better) - one reason is that sift is already designed for an optimal balance of CPU and IO load, and especially searching files in parallel cannot benefit from this as sift uses all CPU cores in that case anyway (and using pgzip just adds additional complexity). sift is just not a good use case for that parallel gzip implementation.