segmentio / parquet-go

Go library to read/write Parquet files
https://pkg.go.dev/github.com/segmentio/parquet-go
Apache License 2.0
341 stars 58 forks source link

Page offset buffers not repooled #372

Closed mdisibio closed 2 years ago

mdisibio commented 2 years ago

Related to memory profiling done in #370 , it looks like page offset buffers are attempted to be pooled, but it is not working as expected. Suspect this logic which swaps the slices.

Testing methodology Added some simple instrumentation to bufferPool to track outstanding allocs via sum(gets) - sum(puts). Added a print statement here to print the before and after capacities when the offset buffer is swapped.

Results The pageOffsetsBufferPool allocs much more than expected, several GB. The other pools have reasonable allocs (1K to ~30MB).

When obuf.data is swapped to a slice of different size, the new slice is 0-length. Unsure if this is specific to our schema or relevant, but wanted to mention.

prevCap := cap(obuf.data)
obuf.data = unsafecast.Uint32ToBytes(pageOffsets)
nowCap := cap(obuf.data)
if prevCap != nowCap {
    fmt.Println("offset buffer swapped", prevCap, nowCap)
}

prints:

offset buffer swapped 4194304 0
offset buffer swapped 2097152 0
offset buffer swapped 2097152 0
offset buffer swapped 2097152 0

Solution Tried this simple fix, but it doesn't quite work. Our tests surface some data inconsistencies afterwards.

if obuf != nil {
    obuf = &buffer{
          pool: obuf.pool,
          data: unsafecast.Uint32ToBytes(pageOffsets),
    }
}

Hacking a little more does work, but I'm not familiar enough with this area to put a proper solution. This change does result in 50+% fewer allocs.

if obuf != nil {
    if cap(obuf.data) != cap(pageOffsets)*4 {
          obuf = &buffer{pool: obuf.pool}
    }
    obuf.data = unsafecast.Uint32ToBytes(pageOffsets)
}
achille-roussel commented 2 years ago

Hello @mdisibio

I believe this issue might be isolated to Go 1.17 and the use of unsafe.Slice in unsafecast.BytesToUint32 which loses the capacity information.

A different fix could be this:

diff --git a/column.go b/column.go
index d511623..520ee77 100644
--- a/column.go
+++ b/column.go
@@ -636,7 +636,7 @@ func (c *Column) decodeDataPage(header DataPageHeader, numValues int, repetition
        if pageKind == ByteArray {
                obuf = buffers.get(4 * (numValues + 1))
                defer obuf.unref()
-               pageOffsets = unsafecast.BytesToUint32(obuf.data)
+               pageOffsets = unsafecast.BytesToUint32(obuf.data[:cap(obuf.data)])
        }

        values := pageType.NewValues(pageValues, pageOffsets)
@@ -647,7 +647,8 @@ func (c *Column) decodeDataPage(header DataPageHeader, numValues int, repetition
                vbuf.data = pageValues
        }
        if obuf != nil {
-               obuf.data = unsafecast.Uint32ToBytes(pageOffsets)
+               obuf.data = unsafecast.Uint32ToBytes(pageOffsets[:cap(pageOffsets)])
+               obuf.data = obuf.data[:4*len(pageOffsets)]
        }
        if err != nil {
                return nil, err

I would recommend upgrading to 1.18+ tho, the unsafecast package is written with generics then and does not suffer from the same issues.

And when Go 1.20 gets released we will need to have more serious discussions about dropping support for 1.17.

mdisibio commented 2 years ago

Hi @achille-roussel afaik Tempo has been updated to 1.19, and I am running 1.19.1 locally. Tested the proposed lines above and didn't see an improvement. Happy to provide more info or try things out.

achille-roussel commented 2 years ago

Thanks for confirming, sorry for making the wrong assumption here.

I'll dig further to understand why the capacity information gets lost.

Super happy to hear that you were able to do the upgrade to 1.19 as well!

achille-roussel commented 2 years ago

I would be curious to know if using the Release/Retain functions introduced in #380 will modify the behavior you observed here!

mdisibio commented 2 years ago

Hi @achille-roussel as discussed offline, the new Release/Retain functionality did not change the observed behavior. Spent some more effort in this area and was able to create a more formal code change. I think this also shows more clearly what I think is happening (page buffer being swapped and original lost), but I don't know the why.

    if obuf != nil {
        if unsafecast.PointerOf(obuf.data) != unsafecast.PointerOf(pageOffsets) {
            // Pageoffset buffer is being swapped.
            // Recreate obuf which allows the orignal to be unrefed and returned to the pool.
            obuf = &buffer{pool: obuf.pool, data: unsafecast.Uint32ToBytes(pageOffsets)}
        }
    }

This change yields significant memory alloc improvements on top of the benchmarks in Tempo #1818

name                                      old alloc/op   new alloc/op   delta
BackendBlockSearchTraces/noMatch-12          333MB ±11%      50MB ±17%  -84.97%  (p=0.000 n=10+10)
BackendBlockSearchTraces/partialMatch-12     605MB ± 2%     311MB ± 5%  -48.49%  (p=0.000 n=9+9)
BackendBlockSearchTraces/service.name-12    3.67MB ± 2%    1.51MB ± 3%  -58.73%  (p=0.000 n=10+10)
mdisibio commented 2 years ago

As discussed offline, the real root cause is that page offset buffers were being unnecessarily allocated for dictionary-encoded byteArray columns. Submitted #398.

Want to mention that #386 still yielded a minor improvement in cpu and may be worth merging (although was neutral on memory usage). Here is a benchmark comparison:

name                                      old time/op    new time/op    delta
BackendBlockSearchTraces/noMatch-12          204ms ± 6%     198ms ±31%    ~     (p=0.400 n=9+10)
BackendBlockSearchTraces/partialMatch-12     2.57s ± 2%     2.51s ± 2%  -2.38%  (p=0.002 n=9+8)
BackendBlockSearchTraces/service.name-12    1.57ms ±12%    1.46ms ±12%  -6.94%  (p=0.028 n=9+10)

name                                      old alloc/op   new alloc/op   delta
BackendBlockSearchTraces/noMatch-12          342MB ± 4%     332MB ± 4%  -2.96%  (p=0.011 n=9+8)
BackendBlockSearchTraces/partialMatch-12     601MB ± 1%     607MB ± 2%  +1.11%  (p=0.031 n=9+9)
BackendBlockSearchTraces/service.name-12    3.67MB ± 1%    3.69MB ± 2%    ~     (p=0.460 n=8+10)

name                                      old allocs/op  new allocs/op  delta
BackendBlockSearchTraces/noMatch-12           119k ± 1%      117k ± 4%  -1.73%  (p=0.021 n=8+8)
BackendBlockSearchTraces/partialMatch-12     18.0M ± 0%     18.0M ± 0%    ~     (p=0.438 n=8+9)
BackendBlockSearchTraces/service.name-12     39.2k ± 0%     39.2k ± 0%    ~     (p=0.403 n=10+10)