marcboeker / go-duckdb

go-duckdb provides a database/sql driver for the DuckDB database engine.
MIT License
714 stars 106 forks source link

Possible memory leak when processing parquet files #81

Closed goober closed 1 year ago

goober commented 1 year ago

We have seen that the used memory keeps increasing when we are processing parquet files with the go-duckdb library. After we have processed a single parquet file and closes the db connection I assume that the used memory is released. However, as seen in the example graph below that is not the case, and for repetitive processing of the same source the memory continues to increase.

Expected behaviour

The memory used for querying a parquet file is released when it is finished.

Actual behaviour

The memory continues to increase and it does not seem to ever release it. image

Reproduce

I have created a tiny project to reproduce the issue that can be found here: https://github.com/goober/duckdb-memory-leak

marcboeker commented 1 year ago

Hi @goober thanks for bringing this up and also providing a repo with code to reproduce this. I haven't had the time to look into it, but two things that come into my mind.

  1. In the beginning of writing this driver I have encountered a similar behaviour and thought, that there is a memory leak. It has turned out, that Go is very greedy about memory and the GC only releases it once he has to as other resources are requesting the memory. This could take some time. Have you tried to close the DB connection and wait for some time afterwards (while the program is still running) and observe the memory usage? Maybe the memory will be freed after some period of time.

  2. Have you tried to run your queries using the Python DuckDB driver and compared it to go-duckdb? Maybe this is not a memory leak on the Go/CGO side but more in the DuckDB or Parquet part?

Once I have more time, I'll dig deeper into it.

goober commented 1 year ago

I have updated the repo with a second application based on Python that should process the parquet file in the same way as the golang application. Even though it does not release all memory that I would have expected, it releases something as you can see in the image: image

I have also let the golang application run for a while to see if any memory is released, but unfortunately it keeps the memory at the same level.

I saw that another issue, that could be related, has been opened against the duckdb repository recently but in their case it was a NodeJs based application. I add it here for reference as well. https://github.com/duckdb/duckdb/issues/6607

goober commented 1 year ago

@marcboeker I have found some memory leaks when the result is being processed in chunks. See linked PR #83 for a fix.