segmentio / parquet-go

Go library to read/write Parquet files
https://pkg.go.dev/github.com/segmentio/parquet-go
Apache License 2.0
341 stars 104 forks source link

Parallel I/O #300

Closed achille-roussel closed 2 years ago

achille-roussel commented 2 years ago

This PR adds a new pio package, along with APIs to support performing parallel I/O on parquet files. The intent is to be able to amortize the cost of I/O latency when reading multiple file sections (e.g. when loading pages from multiple columns).

I am opening this PR against #297, it only is a building block that I intend to use to address performance issues when reading parquet files.

The key change is the introduction of pio.MultiReadAt(io.ReaderAt, []Op), which is the core API that the implementation will rely on. I also added platform specific implementations of this API, leveraging async I/O operations on Linux and Darwin, as well as a generic fallback mechanism using the Go runtime. Finally, I added an extension mechanism supported by implementation of the pio.ReaderAt interface, and a test suite in the pio/piotest package to validate the behavior of custom implementations of that interface.

achille-roussel commented 2 years ago

I'm going to close this, abandoning this approach in favor of #301