Closed zolstein closed 1 year ago
Apologies to make more work for you, but we've decided to move development on this project to a new organization at https://github.com/parquet-go/parquet-go to ensure its long term success. We appreciate your contribution and would appreciate if you could reopen this PR there if it is still relevant.
Create an internal method on Schema to reconstruct a value, passing in a [][]Value to use as columns. Use this internal method, rather than Reconstruct, when reading rows in GenericReader.
Calling Reconstruct on every row being read, and constructing a new [][]Value, in aggregate, accounts for the majority of allocations while reading parquet files and induces an unnecessarily large GC overhead.
Test code used to identify issues
test.parquet is a 768MB parquet file with 32M records. ```go type TestStruct struct { Field1 int64 Field2 int64 Field3 int64 } func main() { entries := make([]TestStruct, 1024) inFile, err := os.Open("test.parquet") if err != nil { log.Fatalf("failed to open parquet file: %v", err) } pr := parquet.NewGenericReader[TestStruct](inFile) for { _, err := pr.Read(entries) if err == io.EOF { break } else if err != nil { log.Fatalf("failed to read parquet entries: %v\n", err) } } f, err := os.Create("mem.pprof") if err != nil { log.Fatalf("failed to open file: %v", err) } defer f.Close() if err := pprof.Lookup("allocs").WriteTo(f, 0); err != nil { log.Fatalf("failed to write heap profile: %v", err) } } ```Profile output (before change)
![profile016](https://github.com/segmentio/parquet-go/assets/7101542/810e36d4-0887-42ca-831c-573ea55da4e7)Profile output (after change)
![profile014](https://github.com/segmentio/parquet-go/assets/7101542/db23e50f-eea9-460a-8c15-304a11bd77e6)