marcboeker / go-duckdb

go-duckdb provides a database/sql driver for the DuckDB database engine.
MIT License
646 stars 97 forks source link

High memory usage reading parquet file generated from DuckDB #255

Open niger-prequel opened 1 month ago

niger-prequel commented 1 month ago

Description

We're experiencing unexpectedly high memory usage when reading a parquet file using go-duckdb. The memory usage is orders of magnitude larger than the file being read. The issue arises during the final step where we read a parquet file that was compacted by DuckDB from multiple smaller files. Raised a parallel issue on the main repository because we were able to reproduce this with other clients.

Steps to Reproduce

Please refer to the provided repository which includes a main.go file and the parquet files necessary to reproduce this issue. Clone the repository and follow the README instructions to set up and trigger the problem. We experiencing the high memory utilization on the final step, where we read the Parquet file.

Expected Behavior

Memory usage should be proportional to the size of the parquet file being read, similar to executing the SQL commands directly without involving the DuckDB Golang driver.

Actual Behavior

The memory consumption spikes significantly on both our production Kubernetes cluster and local machine setups, going well beyond the actual size of the parquet file. This high memory usage is specific to when using the go-duckdb driver, as direct SQL execution does not replicate the issue. You can use the pure.sql script and instructions in the README to run a version of this without using the Go driver.

Production Kubernets Memory Monitoring

Screenshot 2024-07-25 at 2 05 08 PM

Memory Usage of Script on OSX

Screenshot 2024-07-25 at 2 26 30 PM

Screenshot 2024-07-25 at 2 36 49 PM

Environment

Go version: 1.21.7 DuckDB version: 1.0.0 and 0.10.0 go-duckdb version: 1.7.0 Operating System: Debian Buster and OSX Sonoma 14.5 Additional Information

Impact

This issue is causing significant resource allocation challenges in our production environment, leading to potential service disruptions.