paulmach commented 8 years ago

On my amd64 macbook pro with go1.7

benchmark                       old ns/op      new ns/op      delta
BenchmarkDecode-4               2644119879     1732578144     -34.47%
BenchmarkDecodeConcurrent-4     2772070000     1807731956     -34.79%

go1.6

benchmark                       old ns/op      new ns/op      delta
BenchmarkDecode-4               3263413334     1998678477     -38.75%
BenchmarkDecodeConcurrent-4     3355724739     2145554965     -36.06%

benchmarks run with go test -bench . --benchtime 10s so the numbers are 3-5 run averages.

I don't know why this is such a large improvement. I tried a bunch of more/similar things and nothing changed the performance like this.

coveralls commented 8 years ago

Coverage increased (+6.4%) to 83.014% when pulling d14b73d1c237e97a216e002eb0adcb25c73e5903 on paulmach:resuse-buffers into 77520ebfd6e8d3bc876f14307b2576c951a48da4 on qedus:master.

paulmach commented 8 years ago

Updated. The 32megs is an upper bound so I don't have to grow the buffer ever. Could try using bytes.Buffer to handle a lot of that. I think the 32megs is fine since it's only once.

paulmach commented 8 years ago

I should add, to your point, in the file I tested on it was in the 2megs range.

AlekSi commented 8 years ago

Every change breaks someone's workflow. 😄 People may run hundreds of decoders in parallel and may not expect increased memory consumption. So yeah, let's use bytes.Buffer.Grow().

Great improvement!

coveralls commented 8 years ago

Coverage increased (+6.5%) to 83.106% when pulling 8842806ad963d55731f5590a04b28d075f92ab87 on paulmach:resuse-buffers into 77520ebfd6e8d3bc876f14307b2576c951a48da4 on qedus:master.

paulmach commented 8 years ago

Updated so header and blob buffers use bytes.Buffer. Unless I'm doing something totally wrong i'm not seeing any speed improvement anymore.

BenchmarkDecode-4               1721956082     2514611990     +46.03%
BenchmarkDecodeConcurrent-4     1795521119     2687821585     +49.70%

This is what I was seeing in general. This one change was a huge win, other similar change had minimal if any impact. I don't know why.

jongillham commented 8 years ago

@paulmach thanks for all the work. I haven't looked in depth but are there concurrency issues with this? I think it is possible for the buffers to be called from multiple goroutines and I am not sure it is safe to do that.

Have you thought about trying sync.Pool?

paulmach commented 8 years ago

Yes, this is safe, tests also pass with go test -race.

decoder.Start() launches a goroutine that sequentially reads the fileblocks/blobs and passes them to the concurrent "decoders". The buffer used to hold the file byte data before it gets protobuf decoded into a *OSMPBF.Blob is what is being reused here. Any decoded data is copied into the proto struct by the proto.Unmarshal method (I tested this and looked at code). ie. When protobuf decodes the bytes it copies most of it into protobuf struct's blob.ZlibData []byte attribute. This copy is then used by the decoders, no any part of the reused buffer.

It would be interesting to use a sync.Pool for this blob.ZlibData and then for the decoded zlib data (before if gets protobuf decoded again), but when I was trying other/similar improvements they didn't make much, if any, improvement.

Again, I don't know why preallocating a full 32 meg buffer results in such a dramatic speed improvement, but it is significant.

Note that the speed improvement is in the first commit in this PR. The second commit does the cleaner implementation that does not result in much of a change in performance.

AlekSi commented 8 years ago

It all looks very interesting, I will take another look this week.

paulmach commented 8 years ago

I rolled back the byte.Buffer stuff that didn't make a difference on performance.

The pull being requested is the preallocation of the buffers that saves 33% on benchmarks.

jongillham commented 8 years ago

I don't quite like the idea of preallocating such a large 32 Mb buffer space when most uses will probably not require anywhere near that amount. So I tried:

func (d *Decoder) buffer(size int) []byte {
       if len(d.buf) < size {
               d.buf = make([]byte, size)
       }
       return d.buf[:size]
}

I started with an initial buffer size of 0 bytes. The buffer size was increased 10 times to a maximum size of 500k with the test data. However I only saw a 2 - 3% decrease in execution time. If I initialized the buffer big enough so that it was never reallocated (i.e. 500k) I also only saw maybe 10 - 14% decrease in execution time. However with the changes made in this pull request, I saw 40% decrease in execution time as described by @paulmach which is pretty spectacular! I saw similar speed having one member variable buf instead of sizeBuf, headerBuf and blobBuf.

@AlekSi what are your thoughts on preallocating a maxBlobSize buffer for the decoder to use? I don't see why having a gradually increasing buffer size negates most performance improvements.

AlekSi commented 7 years ago

So I took a look. And then again. And again. In the end, I exposed a method to set buffer size. See #19.

I don't see why having a gradually increasing buffer size negates most performance improvements.

Me too. And that bothers me.

AlekSi commented 7 years ago

19 was merged. @paulmach, thanks again.

qedus / osmpbf

Reuse size, header, blob decode byte buffers #15

19 was merged. @paulmach, thanks again.