segmentio / parquet-go

Go library to read/write Parquet files
https://pkg.go.dev/github.com/segmentio/parquet-go
Apache License 2.0
341 stars 102 forks source link

Broken reader - index out of range #493

Closed danwt closed 1 year ago

danwt commented 1 year ago

image

https://github.com/segmentio/parquet-go/commit/219dfcfcd6f368b4467f45226885a9fc51e28f9c

Seems to have broken the reader code: getting index of range.

Thanks

bartleyg commented 1 year ago

Are you implementing your own ReadRows function? This looks like it read more than it should have. 64 then 128 like r.rowIndex is not being incremented or SeekToRow is not being called.

danwt commented 1 year ago

No I'm not implementing my own function

import (
    "github.com/segmentio/parquet-go"
    "io"
)

type ReaderAtSeeker interface {
    io.ReaderAt
    io.Seeker
}

// NewGenericReader returns a generic parquet reader. Wraps library New function because library
// function says it only needs ReaderAt, but really it also needs Seeker.
func NewGenericReader[T any](r ReaderAtSeeker, options ...parquet.ReaderOption) *parquet.GenericReader[T] {
    return parquet.NewGenericReader[T](r, options...)
}
func FromReaderAtSeeker(r parquetx.ReaderAtSeeker) (*MemStore, error) {

    pReader := parquetx.NewGenericReader[Row](r)

    s := New(pReader.NumRows())
    read := int64(0)
    nRows := pReader.NumRows()
    buf := make([]Row, parquetRowReadBatchSize)
    for read < nRows {
        num, err := pReader.Read(buf)
        read += int64(num)
        if err != nil && !errors.Is(err, io.EOF) {
            return nil, err
        }
        // very important to avoid copy here
        for i := 0; i < num; i++ {
            err = addRow(s, &buf[i])
        }
        if err != nil {
            return nil, err
        }
    }

    if err := pReader.Close(); err != nil {
        return nil, err
    }

    return s, nil
}
bartleyg commented 1 year ago

Thank you for sharing. I am able to reproduce this now and am doing some due diligence first in case there's a deeper level bug than the quick fix here.

danwt commented 1 year ago

Awesome thanks I will try it