marcboeker / go-duckdb

go-duckdb provides a database/sql driver for the DuckDB database engine.
MIT License
714 stars 106 forks source link

Memory usage expands unbounded, likely memory leak #111

Closed michaelmdresser closed 1 year ago

michaelmdresser commented 1 year ago

I believe there may be a memory leak in this library. We have a use-case that is similar to https://github.com/duckdb/duckdb/issues/8061 (that issue is from a fellow user of go-duckdb). The particularly relevant points from that issue are: "Golang's runtime is not holding this memory, according to the built in runtime/pprof package" and "closing the Golang DuckDB connection does not release the memory. Only ending the process does".

I have put together a reproduction using go-duckdb and a parallel implementation in a Python script using the official duckdb library, linked in gists below with accompanying run information.

The reporter of duckdb/duckdb#8061 is leaning towards a problem with Parquet inserts in particular, but I believe I have ruled that out with this minimal reproduction which only does basic DuckDB ops: CREATE TABLE, INSERT, CREATE SEQUENCE, and SELECT. Memory usage of the program using go-duckdb expands as queries are executed and DB connections are opened and closed. The usage of the Python program, on the other hand, remains static no matter how many queries are executed and database connections opened/closed.

Reproduction

If any part of these reproduction scripts appears incorrect or has a flawed methodology, please point it out! I realize that there could be some amount of fluctuation due to random string generation and the random() in the SELECT, but I think the results are compelling enough as-is without making it deterministic. When I added a fixed rand.Seed to the Go program, the results were completely comparable to those posted below.

Go

https://gist.github.com/michaelmdresser/bd00c8cb2856a6007fdab63280052d81

Run results. Note how the max res set increases over time, both if the DB is closed in between queries ("repeat") and if it is kept open ("once"). This is with github.com/marcboeker/go-duckdb v1.4.3.

I included the "once" code path to try to start ruling out problems that are unique to the connection rotation idea. With that said, the main functionality I am interested in for our project is the "repeat" code path.

go build -o ./main main.go

DB_OPEN_METHOD="once" NUM_ROTATIONS="1" /usr/bin/time -f '%E elapsed, %M KB max res set' ./main
0:02.80 elapsed, 36536 KB max res set
DB_OPEN_METHOD="once" NUM_ROTATIONS="5" /usr/bin/time -f '%E elapsed, %M KB max res set' ./main
0:02.87 elapsed, 34604 KB max res set
DB_OPEN_METHOD="once" NUM_ROTATIONS="10" /usr/bin/time -f '%E elapsed, %M KB max res set' ./main
0:02.87 elapsed, 41292 KB max res set
DB_OPEN_METHOD="once" NUM_ROTATIONS="50" /usr/bin/time -f '%E elapsed, %M KB max res set' ./main
0:03.04 elapsed, 46348 KB max res set
DB_OPEN_METHOD="once" NUM_ROTATIONS="100" /usr/bin/time -f '%E elapsed, %M KB max res set' ./main
0:03.50 elapsed, 47688 KB max res set
DB_OPEN_METHOD="once" NUM_ROTATIONS="500" /usr/bin/time -f '%E elapsed, %M KB max res set' ./main
0:05.60 elapsed, 63212 KB max res set
DB_OPEN_METHOD="once" NUM_ROTATIONS="1000" /usr/bin/time -f '%E elapsed, %M KB max res set' ./main
0:07.82 elapsed, 57876 KB max res set

DB_OPEN_METHOD="repeat" NUM_ROTATIONS="1" /usr/bin/time -f '%E elapsed, %M KB max res set' ./main
0:02.80 elapsed, 31336 KB max res set
DB_OPEN_METHOD="repeat" NUM_ROTATIONS="5" /usr/bin/time -f '%E elapsed, %M KB max res set' ./main
0:02.90 elapsed, 39744 KB max res set
DB_OPEN_METHOD="repeat" NUM_ROTATIONS="10" /usr/bin/time -f '%E elapsed, %M KB max res set' ./main
0:03.07 elapsed, 50896 KB max res set
DB_OPEN_METHOD="repeat" NUM_ROTATIONS="50" /usr/bin/time -f '%E elapsed, %M KB max res set' ./main
0:03.62 elapsed, 62312 KB max res set
DB_OPEN_METHOD="repeat" NUM_ROTATIONS="100" /usr/bin/time -f '%E elapsed, %M KB max res set' ./main
0:04.49 elapsed, 64584 KB max res set
DB_OPEN_METHOD="repeat" NUM_ROTATIONS="500" /usr/bin/time -f '%E elapsed, %M KB max res set' ./main
0:10.82 elapsed, 74344 KB max res set
DB_OPEN_METHOD="repeat" NUM_ROTATIONS="1000" /usr/bin/time -f '%E elapsed, %M KB max res set' ./main
0:19.39 elapsed, 81788 KB max res set

Python

This script is a parallel implementation of the above Go script, but locked to DB_OPEN_METHOD="repeat": https://gist.github.com/michaelmdresser/5df57b5c9c1e1e05800fa7cec950a288. Note how the max res set size remains the same.

NUM_ROTATIONS="1" /usr/bin/time -f '%E elapsed, %M KB max res set' ./main.py
0:00.32 elapsed, 121468 KB max res set
NUM_ROTATIONS="5" /usr/bin/time -f '%E elapsed, %M KB max res set' ./main.py
0:00.33 elapsed, 123688 KB max res set
NUM_ROTATIONS="10" /usr/bin/time -f '%E elapsed, %M KB max res set' ./main.py
0:00.37 elapsed, 124100 KB max res set
NUM_ROTATIONS="50" /usr/bin/time -f '%E elapsed, %M KB max res set' ./main.py
0:00.55 elapsed, 123252 KB max res set
NUM_ROTATIONS="100" /usr/bin/time -f '%E elapsed, %M KB max res set' ./main.py
0:00.76 elapsed, 126096 KB max res set

Our production use and a theory

When using go-duckdb in a larger application (not the reproduction below), we have noticed that memory usage appears to increase without bound, with particular steps when we "rotate" DB connections (close all connections, move DB files around, reopen connections). Here's an example graph:

image (4)

In the most extreme case, I've seen the memory usage of the Go program using go-duckdb increase by ~exactly what we set the MEMORY_LIMIT DuckDB config value to. For the following screenshot of memory usage, look at the yellowish line in the middle. Mem limit was set to 200GB at that time and we perform a connection "rotation" about every two hours, coinciding with the spikes between plateaus. Note how the plateau increases by about 200GB after each spike and doesn't seem to go below the new "floor" afterward. Eventually the Pod (this is running in Kubernetes) is OOM killed.

image (5)

I recently asked a question in the DuckDB repo about the memory limit parameter https://github.com/duckdb/duckdb/issues/8398#issuecomment-1656348028 in which I learned about the "buffer pool memory" which that parameter applies to. At the moment, I'm wondering if this buffer pool is not being freed and that is what is causing the leak. This is no more than a theory, so take it with salt.

marcboeker commented 1 year ago

Thanks @michaelmdresser for bringing this up. It could take some time until I can look at it in details. I have experienced something similar some while ago and my explanation was, that it could by Go's greedy memory block handling. See https://go101.org/article/memory-block.html

An unused memory block may not be released to OS immediately after it is collected, so that it can be reused for new some value parts.

If you run this for a while, do you see the OS is swapping?

michaelmdresser commented 1 year ago

Thanks for the speedy reply! I'll have a look at the memory block suggestion. While I'm doing that, some more information:

I've been following the leak-tracking approach outlined in https://kirshatrov.com/posts/finding-memory-leak-in-cgo: using memleak.py (from https://github.com/iovisor/bcc/tree/master/tools) on the repro program. I've added pauses (wait for user to hit enter) at the start of main() (so that memleak.py can be attached to the process) and before exiting main() (so that memleak.py can get a full picture of leaks).

That gives the following picture at the end of the run:

[13:26:14] Top 10 stacks with outstanding allocations:
    48 bytes in 1 allocations from stack
        0x00007fbe234b089d  operator new(unsigned long)+0x1d [libstdc++.so.6.0.32]
    262144 bytes in 1 allocations from stack
        0x00000000006e852c  x_cgo_mmap+0xc [main]
        0x00000000005dedab  runtime.mmap.func1+0x4b [main]
        0x00000000005decf0  runtime.mmap+0x70 [main]
        0x00000000005f0ee5  runtime.sysAllocOS+0x25 [main]
        0x00000000005f0d35  runtime.sysAlloc+0x35 [main]
        0x00000000005e8445  runtime.persistentalloc1+0x105 [main]
        0x00000000005e8328  runtime.persistentalloc.func1+0x28 [main]
        0x00000000005e82bc  runtime.persistentalloc+0x5c [main]
        0x00000000005f3c37  runtime.(*fixalloc).alloc+0x77 [main]
        0x000000000060241f  runtime.(*mheap).allocMSpanLocked+0x9f [main]
        0x000000000060269e  runtime.(*mheap).allocSpan+0x21e [main]
        0x00000000006020bc  runtime.(*mheap).alloc.func1+0x5c [main]
        0x000000000063f04a  runtime.systemstack.abi0+0x4a [main]
        0x000000000060201b  runtime.(*mheap).alloc+0x5b [main]
        0x00000000005f0912  runtime.(*mcentral).grow+0x52 [main]
        0x00000000005f0786  runtime.(*mcentral).cacheSpan+0x306 [main]
        0x00000000005efed3  runtime.(*mcache).refill+0x153 [main]
        0x00000000005e74c5  runtime.(*mcache).nextFree+0x85 [main]
        0x00000000005e7a88  runtime.mallocgc+0x448 [main]
        0x00000000005e7fc5  runtime.newobject+0x25 [main]
        0x00000000006dae88  github.com/marcboeker/go-duckdb.(*connector).Connect+0x48 [main]
        0x0000000000692317  database/sql.(*DB).conn+0x717 [main]
        0x0000000000693117  database/sql.(*DB).exec+0x57 [main]
        0x000000000069304f  database/sql.(*DB).ExecContext.func1+0x4f [main]
        0x0000000000692e22  database/sql.(*DB).retry+0x42 [main]
        0x0000000000692f68  database/sql.(*DB).ExecContext+0xc8 [main]
        0x00000000006e58e8  main.buildDBData+0x48 [main]
        0x00000000006e6365  main.run+0x565 [main]
        0x00000000006e6cf6  main.main+0x1f6 [main]
        0x00000000006131bb  runtime.main+0x2bb [main]
        0x0000000000640fc1  runtime.goexit.abi0+0x1 [main]
    2013265920 bytes in 15 allocations from stack
        0x00007fbe23099864  [unknown] [libc.so.6]

Note that the leak chunk is under [unknown] [libc.so.6] which I think means it isn't coming from the DuckDB library itself (when I've watched, DuckDB library allocations tend to show up under a [libstdc++.so.6.0.32] stack).

I'm now leaning away from this being somehow related to the memory pool and more towards this being related to some usage of libc in the Go library; perhaps there is some C.CString or similar not being freed.

michaelmdresser commented 1 year ago

Moving some pauses around and still using memleak.py from the BCC tools, I got a track of leaks after run() start, after DB inserts, and after insert DB close, but before the repeated queries start. Even then, with just INSERTs, there appears to be a leak. No Go DB is open at the time of the below.

[13:41:41] Top 10 stacks with outstanding allocations:
    48 bytes in 1 allocations from stack
        0x00007f4851cb089d  operator new(unsigned long)+0x1d [libstdc++.so.6.0.32]
    1216 bytes in 4 allocations from stack
        0x00007f485201c813  [unknown] [ld-linux-x86-64.so.2]
    1006632960 bytes in 15 allocations from stack
        0x00007f48518997d1  [unknown] [libc.so.6]
michaelmdresser commented 1 year ago

I just learned about the -a flag for memleak.py, which shows individual allocations in addition to the summary for each stack. I'm really intrigued by a couple of things: when I run this test with various #s of queries, I always see "<> bytes in 15 allocations from stack" (the 15 seems constant) and also each allocation in each stack (according to -a) seems to be the same size.

[14:57:33] Top 10 stacks with outstanding allocations:
    addr = 7f3250001080 size = 48
    addr = 7f32501090a0 size = 304
    addr = 7f3250109270 size = 304
    addr = 7f3250108d00 size = 304
    addr = 7f3250108ed0 size = 304
    addr = 7f323efe6000 size = 8392704
    addr = 7f323e7e5000 size = 8392704
    addr = 7f323dfe4000 size = 8392704
    addr = 7f323f7e7000 size = 8392704
    addr = 7f3224000000 size = 67108864
    addr = 7f322c000000 size = 67108864
    addr = 7f3204000000 size = 67108864
    addr = 7f31f4000000 size = 67108864
    addr = 7f31c4000000 size = 67108864
    addr = 7f31dc000000 size = 67108864
    addr = 7f31cc000000 size = 67108864
    addr = 7f31ec000000 size = 67108864
    addr = 7f320c000000 size = 67108864
    addr = 7f321c000000 size = 67108864
    addr = 7f31fc000000 size = 67108864
    addr = 7f31d4000000 size = 67108864
    addr = 7f31e4000000 size = 67108864
    addr = 7f3214000000 size = 67108864
    addr = 7f3234000000 size = 67108864
    addr = 7f3230000000 size = 134217728
    addr = 7f3228000000 size = 134217728
    addr = 7f3200000000 size = 134217728
    addr = 7f3208000000 size = 134217728
    addr = 7f31f0000000 size = 134217728
    addr = 7f3218000000 size = 134217728
    addr = 7f31c0000000 size = 134217728
    addr = 7f31d0000000 size = 134217728
    addr = 7f3210000000 size = 134217728
    addr = 7f31f8000000 size = 134217728
    addr = 7f3220000000 size = 134217728
    addr = 7f31e0000000 size = 134217728
    addr = 7f31d8000000 size = 134217728
    addr = 7f31e8000000 size = 134217728
    addr = 7f31c8000000 size = 134217728
    48 bytes in 1 allocations from stack
        0x00007f32b02b089d  operator new(unsigned long)+0x1d [libstdc++.so.6.0.32]
    1216 bytes in 4 allocations from stack
        0x00007f32b04de813  [unknown] [ld-linux-x86-64.so.2]
    33570816 bytes in 4 allocations from stack
        0x00007f32afe8d479  pthread_create+0x8e9 [libc.so.6]
    1006632960 bytes in 15 allocations from stack
        0x00007f32afe997d1  [unknown] [libc.so.6]
    2013265920 bytes in 15 allocations from stack
        0x00007f32afe99864  [unknown] [libc.so.6]
begelundmuller commented 1 year ago

Hello @michaelmdresser, I read through your discoveries with interest – we have seen some similar memory leaks in production usage. I haven't had time to dig deeper, but I was wondering if you investigated whether the leaks could be happening in the DuckDB C API (instead of in the go-duckdb CGO code), which IIRC is not used by the Python bindings?

michaelmdresser commented 1 year ago

@begelundmuller I got pretty deep in this rabbit hole, then had to jump out of it to work on something else; this issue needs an update.

A summary: I am now of the belief that the DuckDB C API is leaking. I've seen this in two places: an ATTACH without a DETACH and usage of read_parquet. In both situations, a database close call (C API) should have freed all memory used by DuckDB but did not. We had both situations in our codebase: fixing each independently did not resolve leak behavior, but fixing both did.


For our production situation, I spent an enormous amount of time trying to figure out if the leak was being caused by go-duckdb's CGO. My conclusion was no. As long as you correctly call .Close() in the appropriate places, go-duckdb appears to be calling DuckDB's C API correctly (all opens of DBs are matched by closes, and all opens of connections are matched by closes).

What I did find was two interesting things.

First is that we were using an ATTACH statement in some DuckDB queries but were missing its paired DETACH. I believe that there is a leak here because while this was user error, I would expect a full database "close" call (DuckDB C API) to free memory associated with an ATTACH even if we had forgotten to DETACH at some other point. Our observations suggests that this is not the case. My intention is to try to repro this theory with an official DuckDB library which uses the C API, like perhaps the Julia library, but I haven't had the time.

Second is that usage of read_parquet appears to leak memory. As mentioned at the top of this issue, I was initially sent down this rabbit hole by https://github.com/duckdb/duckdb/issues/8061 because it was a go-duckdb user using read_parquet experiencing a memory leak like we were. We eventually settled on the same workaround (shelling out to the DuckDB CLI for read_parquet calls) because it works for us, but it's clearly not the best answer to the problem. In my casual monitoring of the DuckDB repo's issues, I've noticed another two issues related to memory usage and the Parquet extension: https://github.com/duckdb/duckdb/issues/8623 and https://github.com/duckdb/duckdb/issues/8537.


@marcboeker Thank you for the excellent Go library. I'm nearing 100% certainty that the core issue I'm trying to address is in DuckDB, not here. While the behavior I outlined when attempting a minimal repro for this issue is interesting, I don't think it's the behavior I'm ultimately trying to resolve. Would you like me to close this issue?

niger-prequel commented 1 year ago

@michaelmdresser I think you may be right. I have also had to put this issue on back burner due to other work priorities but we saw similar issues when we tried to leverage the ODBC driver via instead of using this driver. I appreciate your diligence on this, at least its nice to know someone else is experiencing this.

marcboeker commented 1 year ago

@michaelmdresser Thanks for the in-depth analysis of the memory leak and for finding the potential cause. I'm curious to see, what the DuckDB developers are thinking about this.

Thank you for the excellent Go library.

Thanks, but this is a community effort by a lot of contributors πŸ™‚

I'm closing this for now. Feel free to reopen if you think go-duckdb should be the cause.

goober commented 1 year ago

Thanks for the thorough investigation. We are seeing the same issue in our environment, especially with queries like COPY (SELECT * from read_csv('data/*.csv.zst')) TO 'data.parquet' (FORMAT parquet) We have also come to the conclusion that the go-duckdb seems to do what it should in regards of releasing memory.

@michaelmdresser have you opened a specific issue towards the duckdb repo to track this issue further?

Mytherin commented 1 year ago

Thanks for the thorough investigation. I'm still not sure if the problem here is in DuckDB itself or in the Go driver. I tried to recreate the Go example you linked in the C API and found no memory leaks. The complete script is below - let me know if I am missing anything critical here.

One aspect to note is that DuckDB's internal objects reference each other, and the database is only closed/destroyed when all objects are closed/destroyed. That is to say, if you leave a connection, prepared statement or appender open, the connection/prepared statement/appender will keep a reference to the database object and the database will not be destroyed. I suspect that what is happening here is that one of these types of objects is not correctly destroyed in the Go API causing the database not to be destroyed and the memory to leak.

As an example - in the below C script if I comment out duckdb_destroy_prepare the memory usage of the C script looks like this and there is a clear leak because the database objects are not cleaned up:

00s: 27908 bytes
05s: 33756 bytes
10s: 40248 bytes
15s: 46092 bytes
20s: 53152 bytes
25s: 58436 bytes
30s: 65620 bytes
35s: 72044 bytes
40s: 77512 bytes
45s: 84188 bytes
50s: 89216 bytes
55s: 96444 bytes
60s: 101720 bytes

If I'm missing anything or there is a leak that can be reproduced using the C API feel free to let me know and we will be happy to take a look.

C API Memory Usage Script

#include "duckdb.h"
#include "stdio.h"
#include "string.h"

void REQUIRE(bool input) {
    if (!input) {
        printf("FAILED\n");
        exit(1);
    }
}

const char letterBytes[] = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
const idx_t letterCount = sizeof(letterBytes);

char *generate_random_string(char *target, int n) {
    for(idx_t i = 0; i < n; i++) {
        target[i] = letterBytes[rand() % letterCount];
    }
    target[n] = '\0';
    return target;
}

int main() {
    const char *file_path = "db_test.db";

    remove(file_path);
    while(true) {
        duckdb_database database;
        duckdb_connection connection;
        REQUIRE(duckdb_open(file_path, &database) == DuckDBSuccess);
        REQUIRE(duckdb_connect(database, &connection) == DuckDBSuccess);

        duckdb_result result;
        REQUIRE(duckdb_query(connection, "CREATE TABLE test_data (v1 VARCHAR, v2 VARCHAR)", &result) == DuckDBSuccess);
        duckdb_destroy_result(&result);

        REQUIRE(duckdb_query(connection, "CREATE SEQUENCE IF NOT EXISTS serial START 101", &result) == DuckDBSuccess);
        duckdb_destroy_result(&result);     

        duckdb_prepared_statement prepared;
        REQUIRE(duckdb_prepare(connection, "INSERT INTO test_data (v1, v2) VALUES (?, ?)", &prepared) == DuckDBSuccess);

        for(idx_t i = 0; i < 1000; i++) {
            char string_buffer[200];
            REQUIRE(duckdb_bind_varchar(prepared, 1, generate_random_string(string_buffer, 100)) == DuckDBSuccess);
            REQUIRE(duckdb_bind_varchar(prepared, 2, generate_random_string(string_buffer, 150)) == DuckDBSuccess);

            REQUIRE(duckdb_execute_prepared(prepared, &result) == DuckDBSuccess);
            duckdb_destroy_result(&result);
        }
        duckdb_destroy_prepare(&prepared);

        REQUIRE(duckdb_query(connection, "SELECT CONCAT(v1, v2), random(), nextval('serial') FROM test_data", &result) == DuckDBSuccess);
        duckdb_destroy_result(&result);

        duckdb_disconnect(&connection);
        duckdb_close(&database);

        remove(file_path);
    }
}

I then run the following Python script to measure the memory usage of this program:

import subprocess
import time

proc = subprocess.Popen(['pgrep', 'a.out'], stdout=subprocess.PIPE)
res = proc.stdout.read().decode('utf8').strip()

process_id = int(res)

pid = str(process_id)

for i in range(100000):
    time.sleep(1)
    proc = subprocess.Popen(f'ps -o rss= -p {pid}'.split(' '), stdout=subprocess.PIPE)
    res = proc.stdout.read().decode('utf8').strip()
    if i % 5 == 0:
        print('%03ds: %s bytes' % (i, res))

This gives me the following memory readings:

000s: 27976 bytes
005s: 28960 bytes
010s: 29344 bytes
015s: 28948 bytes
020s: 28676 bytes
025s: 28292 bytes
030s: 29076 bytes
035s: 29208 bytes
040s: 29252 bytes
045s: 28536 bytes
050s: 29388 bytes
055s: 28636 bytes
060s: 29156 bytes
jaredzhou commented 1 year ago

is this problem solved? i have the same problem with goduckdb, just query some data from parquet the memory is huge and never release. the rust client do not have this problem, but i need to use flight to communicate.

taniabogatsch commented 10 months ago

Hi - I found this issue, which might be related to a memory leak caused by not freeing C.duckdb_struct_type_child_name(ty, j). I filed a PR here.

EDIT: I don't see duckdb's STRUCT type in this issue, but I'll try to run the leak profiler on https://gist.github.com/michaelmdresser/bd00c8cb2856a6007fdab63280052d81 and have a look.

michaelmdresser commented 10 months ago

Amazing! @taniabogatsch for what it's worth, I'm not convinced that the original gist/repro I produced as a repro (the one you linked) is actually a repro of the problem we were seeing.

EDIT: And at the time of creating this issue, the "in the wild" memory graph came from an environment that had STRUCT columns in the database. I'm not sure that we ever scanned those STRUCTs -- we had to abandon STRUCTs a while ago due to an unrelated "hang" issue.

michaelmdresser commented 10 months ago

One aspect to note is that DuckDB's internal objects reference each other, and the database is only closed/destroyed when all objects are closed/destroyed. That is to say, if you leave a connection, prepared statement or appender open, the connection/prepared statement/appender will keep a reference to the database object and the database will not be destroyed. I suspect that what is happening here is that one of these types of objects is not correctly destroyed in the Go API causing the database not to be destroyed and the memory to leak.

@Mytherin thank you, this description clarified a lot for me. We'll be pursuing a deeper investigation along those lines soon if Tania's PR doesn't resolve the leak for us. Thank you for your work on DuckDB and for being so responsive in a 3rd party library!

taniabogatsch commented 10 months ago

@michaelmdresser, if your original environment had STRUCTs, then it's possible that the fix is related. πŸ‘

If you still see a memory leak and manage to reproduce it with a Go program, then please ping me here in this issue or in another issue.

jimyag commented 4 months ago

If you still see a memory leak and manage to reproduce it with a Go program, then please ping me here in this issue or in another issue.

hi @taniabogatsch I found that executing any SQL has a portion of memory that cannot be reclaimed.

//go:build cgo
// +build cgo

package main

import (
    "context"
    "database/sql"
    "fmt"
    "log"
    "os"
    "time"

    _ "github.com/marcboeker/go-duckdb"
)

func main() {
    for i := 0; i < 1000; i++ {
        duckdb, err := sql.Open("duckdb", "")
        if err != nil {
            log.Panicln(err)
        }

        conn, err := duckdb.Conn(context.Background())
        if err != nil {
            log.Panicln(err)
        }

        raw, err := conn.QueryContext(context.Background(), "INSTALL parquet;LOAD parquet;INSTALL httpfs;LOAD httpfs;INSTALL json;LOAD json;")
        if err != nil {
            log.Panicln(raw.Err())
        }
        if err = raw.Close(); err != nil {
            log.Panicln(err)
        }

        // json to parquet
        raw, err = conn.QueryContext(context.Background(), fmt.Sprintf("COPY ( SELECT * FROM read_json_auto('%s',union_by_name = true)) TO '%s' (FORMAT 'parquet', CODEC 'zstd');", os.Args[1], fmt.Sprintf("%s_%d.parquet", os.Args[1], i)))
        if err != nil {
            log.Panicln(err)
        }
        if err = raw.Close(); err != nil {
            log.Panicln(err)
        }
        if err = conn.Close(); err != nil {
            log.Panicln(err)
        }
        if err = duckdb.Close(); err != nil {
            log.Panicln(err)
        }
        log.Printf("finish %d", i)
        time.Sleep(time.Second)
    }
    select {}
}

memleak log

              0x00000000008b9c77      main.main+0x2f7 [bson2parquet]
                0x00000000006b575d      runtime.main+0x29d [bson2parquet]
                0x00000000006e6741      runtime.goexit.abi0+0x1 [bson2parquet]
        8396800 bytes in 2 allocations from stack
                0x0000000001c0490d      duckdb_jemalloc::pages_map(void*, unsigned long, unsigned long, bool*)+0x5d [bson2parquet]
        10481664 bytes in 1 allocations from stack
                0x0000000001c04a01      duckdb_jemalloc::pages_map(void*, unsigned long, unsigned long, bool*)+0x151 [bson2parquet]
        50356224 bytes in 6 allocations from stack
                0x00007f2199e95707      pthread_create@GLIBC_2.2.5+0xac7 [libc.so.6]
        1744830464 bytes in 13 allocations from stack
                0x00007f2199ea1c82      alloc_new_heap+0xc2 [libc.so.6]
        2483027968 bytes in 37 allocations from stack
                0x00007f2199ea1c3d      alloc_new_heap+0x7d [libc.so.6]
taniabogatsch commented 4 months ago

Hi @jimyag, thanks for the reproduction. I'll look into it as soon as possible.