segmentio / parquet-go

Go library to read/write Parquet files
https://pkg.go.dev/github.com/segmentio/parquet-go
Apache License 2.0
341 stars 58 forks source link

nil pointer panic in `decodeDataPage()` #369

Closed joe-elliott closed 2 years ago

joe-elliott commented 2 years ago

Seeing nil pointer panics in decodeDataPage() with some regularity:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x15948cb]

goroutine 112818 [running]:
github.com/segmentio/parquet-go.(*Column).decodeDataPage(0xc05e8fe4e0, {0x25d1e60?, 0xc063d18360?}, 0x5c5, 0x0, 0x0, 0xc06aa543f0, {0x0?, 0x0})
    /drone/src/vendor/github.com/segmentio/parquet-go/column.go:664 +0x8ab
github.com/segmentio/parquet-go.(*Column).decodeDataPageV2(0xc05e8fe4e0, {0x7f16d79cd0f8?}, 0xc0802c9d88?, {0x0, 0x0}, 0x1f8f960?)
    /drone/src/vendor/github.com/segmentio/parquet-go/column.go:611 +0x755
github.com/segmentio/parquet-go.(*filePages).readDataPageV2(0xc08d37a5a0?, 0xc06aa542d0?, 0xc06aa542d0?)
    /drone/src/vendor/github.com/segmentio/parquet-go/file.go:616 +0xa9
github.com/segmentio/parquet-go.(*filePages).ReadPage(0xc08d37a5a0)
    /drone/src/vendor/github.com/segmentio/parquet-go/file.go:515 +0xe5
github.com/segmentio/parquet-go.(*multiPages).ReadPage(0xc09ff6d9a0)
    /drone/src/vendor/github.com/segmentio/parquet-go/multi_row_group.go:223 +0x3f
github.com/segmentio/parquet-go.readPages({0x25bae30, 0xc09ff6d9a0}, 0xc0568df8c0, 0xc058a68fc0, 0xc0584fd5c0)
    /drone/src/vendor/github.com/segmentio/parquet-go/page.go:219 +0xc9
created by github.com/segmentio/parquet-go.(*asyncPages).init
    /drone/src/vendor/github.com/segmentio/parquet-go/page.go:161 +0x16a

Seems to be tripping on this:

   repetitionLevels.data,

Reviewing decodeDataPageV2 there are definitely code paths where both definitionLevels and repititionLevels are not set, but I don't understand the implications of this or how to handle scenarios where they are not.

joe-elliott commented 2 years ago

So we recently upgraded our parquet-go dependency: https://github.com/grafana/tempo/pull/1776

The upgrade was from: f812768dfa62 -> 948ea8c65f19

This issue is new as of this upgrade. Also interestingly this issue has diminished: https://github.com/segmentio/parquet-go/issues/332

suggesting a possible relationship between the two?

achille-roussel commented 2 years ago

Thanks for reporting, I'll look into it this today!

achille-roussel commented 2 years ago

I believe the issue might have been introduced by https://github.com/segmentio/parquet-go/pull/341

Prior to this change, the same condition of c.maxRepetitionLevel > 0 was used to test if the repetition levels must be decoded, and if a repeated page must be created.

After the change, the first condition was changed to test whether the repetition level is empty.

In the file you have, there may be empty repetition levels with a max repetition level value greater than zero for the column?

I'm not sure how that's possible, it would seem like an invalid condition for a parquet file, but maybe this happens under some conditions.

Are you able to open the file with a program like the standard parquet-tools?

joe-elliott commented 2 years ago

Ok, so by commenting out some code I have determined that this issue and #332 are both caused by *bufio.Reader pooling. This change makes both problems disappear:

func getBufioReader(r io.Reader, bufferSize int) (*bufio.Reader, *sync.Pool) {
    // pool := getBufioReaderPool(bufferSize)
    // rbuf, _ := pool.Get().(*bufio.Reader)
    // if rbuf == nil {
    rbuf := bufio.NewReaderSize(r, bufferSize)
    // } else {
    //  rbuf.Reset(r)
    // }
    return rbuf, nil
}

A quick review of the code suggests this could be caused by incorrectly using the filePages object after Close() is called (if the timing is just right). Unsure if these issues are a misuse in our code or a bug in the way the bufio.Readers() are pooled in parquet-go. Still digging.

joe-elliott commented 2 years ago

This issue was created by a bad return from Tempo. Linked above but I'll include it here as well:

https://github.com/grafana/tempo/pull/1797

If anyone else is seeing similar issues, I would recommend checking the above PR.