segmentio / parquet-go

Go library to read/write Parquet files
https://pkg.go.dev/github.com/segmentio/parquet-go
Apache License 2.0
341 stars 102 forks source link

[Bug] parquet.readFile only reads 1024 rows #469

Closed vedantroy closed 1 year ago

vedantroy commented 1 year ago

This file (zipped parquet file): debug.zip has more than 1024 rows (around 9K).

But if I read it using the following code:

type RowType struct {
    Name string `parquet:",optional"`
    Data []byte `parquet:",optional"`
}

rows, err := parquet.ReadFile[RowType](file)

and print out the number of rows, I only see that there are 1024 rows. Tools like the pandas library from Python don't have this issue.

mikedewar commented 1 year ago

I have the same issue, but after 26215 rows.

Looks like issue #471 is similar?

I create a (large) file using something like

fh, err := os.Create(fname)
...
writer = parquet.NewGenericWriter[OutType](fh)
nrows := 1000000
for i:=0; i<nrows; i++ {
    ...
    writer.Write(toWrite)
}
writer.Close()

then, later, I read using

rows, err := parquet.ReadFile[OutType](fh)
...
assert.Equal(t, len(rows), nrows)

len(rows) comes out to 26215 on my macbook pro whereas nrows is (obvs) 1000000.

danwt commented 1 year ago

I have the same issue. Should read 9'000'000 + rows, but only get 131'000

danwt commented 1 year ago

Is the reading stopping when going OOM? Without saying?