segmentio / parquet-go

Go library to read/write Parquet files
https://pkg.go.dev/github.com/segmentio/parquet-go
Apache License 2.0
341 stars 102 forks source link

Missing rows after read columns with multiple pages #471

Closed alexmatsak closed 1 year ago

alexmatsak commented 1 year ago

Hi! Seems like there is a bug inside Reader or maybe I'm doing something wrong, details:

I'm using this code to read rows from in-mem parquet buffer and it works good until column has only 1 page.

f, err := parquet.OpenFile(r, size)
if err != nil {
    return nil, err
}

rows := make([]T, f.NumRows())
if _, err := parquet.NewGenericReader[T](f).Read(rows); err != nil {
    return nil, err
}

When second page appeared(in my case only for single column, didn't check if multiple), because of this line my slice contains only 1674 elements.

$ parquet pages ~/my-awesome-file.parquet

Column: Column1.key_value.value.SomeStruct.SomeField
--------------------------------------------------------------------------------
  page   type  enc  count   avg size   size       rows     nulls   min / max
  0-0    data  _ D  1674    141.38 B   231.128 kB 1664     0
  0-1    data  _ D  23      143.39 B   3.221 kB   23       0

Column: Column2
--------------------------------------------------------------------------------
  page   type  enc  count   avg size   size       rows     nulls   min / max
  0-0    data  _ _  1687    8.00 B     13.180 kB  1687     0
.
.
.

Just for testing purpose I tried to switch to this code and on second call it fetched missing 23 rows.

f, err := parquet.OpenFile(r, size)
if err != nil {
    return nil, err
}

rows := make([]T, f.NumRows())
n, err := reader.Read(rows)
if err != nil {
    return nil, err
}

if n != len(rows) {
    n, err = reader.Read(rows[n:])
    if err != nil {
        return nil, err
    }
}

Thanks in advance!