golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.28k stars 17.57k forks source link

compress/bzip2: Slow performance #6754

Closed tomc603 closed 9 years ago

tomc603 commented 10 years ago
When decompressing file data from a bz2 compressed file, Go is much slower than other
popular languages' implementations.

For the simple sample program at http://play.golang.org/p/e0N9J8fsvz, I've incuded pprof
output for Go v 1.1 and 1.2rc2. The program walks a directory of BZ2 compressed text log
files, opens the file and passes the reader to a function that performs an io.copy() to
ioutil.discard. The results are almost exactly the same when processing each text line
from the reader with bufio.NewScanner().

Using large bufio readers or no bufio handling at all does not impact the overall
performance of the bzip2 functionality.

This test was conducted on 64bit Ubuntu Linux 13.10 using 64bit Go binaries. Go 1.2rc2
was downloaded directly from the Go site, whereas 1.1 was installed as an Ubuntu package.

Attachments:

  1. bztest-1.2rc2-discard.txt (5862 bytes)
  2. bztest-1.1-discard.txt (5859 bytes)
robpike commented 10 years ago

Comment 1:

Can you quantify 'much slower'?

Status changed to WaitingForReply.

tomc603 commented 10 years ago

Comment 2:

Sorry for the long delay. Work got in the way of writing a comparison between Go and
Python.
I hate two test scenarios- The first is a 1GB file of data from /dev/zero, bzip
compressed. The second is a 1GB file of data from /dev/urandom also bzip compressed. 
The first should be a best case performance since all of the data is RLE encoded and the
compressed file is a few hundred bytes. The second case should be a worst-case scenario
where the data is not generally compressible and the compressed file is larger than the
source.
Results:
Decompressing /home/tcameron/tmp/decompress/zeros.data.bz2
Go 1.1 Decompress time: 3.212 sec
Py 2.7 Decompress time: 3.070 sec
Decompressing /home/tcameron/tmp/decompress/random.data.bz2
Go 1.1 Decompress time: 528.765 sec
Py 2.7 Decompress time: 104.724 sec
Let's call the zeros.dat.bz2 test even. Milliseconds for this file do not really
interest me. It is worth noting that Python's version is faster...but by less than a
quarter of a second. This could be down to lots of things and I'm not necessarily
interested in tracking them down.
The random.dat.bz2 test is much more enlightening. Slower by a factor of >5 is
surprising to me, and it equates to roughly 1.9MB/sec. I understand there hasn't been
much effort to optimize the bzip library for speed, so I figured my real-world
experience could be used to help the project in some way.
My actual use case of this is a syslog file parser, which I've been writing to replace a
Python script I previously wrote and to drive the lessons of Go into my brain. I see
very similar results with text file processing, but since I can not offer the text files
themselves for others to test with, I've tried something a bit more reproducible.
These tests are being performed on a Lenovo T430 with an SSD, Intel Core i5-3320M CPU @
2.60GHz, and 8GB RAM while plugged into an AC power source. The Operating System is
Ubuntu 13.10 with Kernel 3.11.0-13-generic, x86_64 architecture.
To review the source of each test application, please review my Github repos:
https://github.com/tomc603/pycompresstest
https://github.com/tomc603/gocompresstest
remyoudompheng commented 10 years ago

Comment 3:

Go 1.2 is about 30% faster (revision cf3ee583c568), we're not there yet but it's already
better, can you also have a look at it?
remyoudompheng commented 10 years ago

Comment 4:

Can you also give your method to produce random.data.bz2 ? Thanks.
tomc603 commented 10 years ago

Comment 5:

To produce random.data.bz2:
dd if=/dev/urandom of=random.data; bzip2 random.data
I will check out a newer revision of 1.2 and test again, but from previous
results discussed in the gonuts mailing list, there was a small difference.
Thanks all!
tomc603 commented 10 years ago

Comment 6:

After running the same tests with Go 1.2rc5 a couple times just to confirm I'm not crazy
(still a possibility though), it seems data that is RLE is actually twice as slow as Go
1.1. For these particular tests, I'm not seeing a 30% increase in speed, though I'm
exercising the two most extreme cases.
Results:
Decompressing /home/tcameron/tmp/decompress/zeros.data.bz2
Go 1.1    Decompress time: 3.000 sec
Go 1.2rc5 Decompress time: 6.612 sec
Decompressing /home/tcameron/tmp/decompress/random.data.bz2
Go 1.1    Decompress time: 534.020 sec
Go 1.2rc5 Decompress time: 499.078 sec
rsc commented 10 years ago

Comment 7:

Labels changed: added go1.3maybe.

dsymonds commented 10 years ago

Comment 8:

Labels changed: added performance.

rsc commented 10 years ago

Comment 9:

Labels changed: added release-none, removed go1.3maybe.

rsc commented 10 years ago

Comment 10:

Labels changed: added repo-main.

davecheney commented 10 years ago

Comment 11:

Status changed to Accepted.

gopherbot commented 10 years ago

Comment 12:

CL https://golang.org/cl/131840043 mentions this issue.
jeffallen commented 10 years ago

Comment 13:

11% faster is not insignificant, but there's probably more performance to be squeezed...
looking (casually) for it now.
gopherbot commented 10 years ago

Comment 14:

CL https://golang.org/cl/131470043 mentions this issue.
gopherbot commented 9 years ago

CL https://golang.org/cl/13852 mentions this issue.

gopherbot commented 9 years ago

CL https://golang.org/cl/13853 mentions this issue.