karlseguin / pg.zig

Native PostgreSQL driver / client for Zig
MIT License
158 stars 9 forks source link

Bus error on read - investigating #19

Closed zigster64 closed 4 weeks ago

zigster64 commented 1 month ago

Hit an interesting bug with a particular query .. generates a bus error, so the stack trace is pretty short and not very useful yet.

I need to dive in deeper to debug, so can’t post much more than that yet sorry.

looks like it’s in stdlib reader which then reaches for an allocation.. and blows up somewhere from there.

its 100 % reproducible for a particular query, so that helps. Seeing it on Mac m2, will try other setups as well.

could be pg.zig (because it’s in the reader), or it could be in http.zig as well (or my app code even)

couple more days to track this one down I think

zigster64 commented 1 month ago

OK, its in pg.zig, but the actual bug might be stdlib

Im using 0.13.0 official, without any mods. MacOS arm64

problem occurs in ReaderT.read() reader.zig line 188

new_buf = try allocator.alloc(u8, message_length);

Im using GPAllocator, and the above looks perfectly reasonable to me. Generates a bus error (alignment issue with the allocator ??)

At the point where its crashing, the following data is set :

pos = 4096. (ie - the size of the default read buffer) start = 623 avail = 3473 (looks legit - that's 4k - start) it does the readInt(u32) and gets a value of 9214 -> buffer too small to fit gets down to line 185, is_static = true, so it doesnt try to realloc the buffer calls lib.metrics.allocReader(9214) without issue

line 188, attempts to alloc 9214 bytes from the self.allocator never returns from that function - bus error at that point.

Tried turning on verbose_logging in GPA, and that doesnt report the allocation, so I suspect it's dying at the align cast in the alloc call, which is the first thing it tries to do.

The data in question is an iterator([]const u8) field, with at least 7000 bytes of data in it encoding a few dozen strings (ie - 9214 bytes ? once you add formatting breaks in the pg protocol)

Workaround solution for now - I just set the read buffer sizes to 64k in the pg connection options, up from 4k, and the problem goes away for now ... or rather gets pushed under the carpet till a bigger dataset happens :)

zigster64 commented 1 month ago

… just to complicate things, the same gpa allocator is shared by logz and http.zig in my app

will work on pg test case to reproduce the fault without the noise of everything else using that allocator

karlseguin commented 1 month ago

There's definitely a double-free. Easy to reproduce. I guess there are no test covering allocations beyond read_buffer which is very sad. Not sure if that's the issue you're seeing though, since that manifests itself very clearly when result or row deinit is called.

    var c = t.connect(.{});
    defer c.deinit();

    var row = (try c.row("select $1::text", .{"a" ** 5000})).?;
    defer row.deinit();

    try t.expectString("a" ** 5000, row.get([]u8, 0));

Not fixed yet, but should be soon.

karlseguin commented 1 month ago

There were 2 serious (and separate issues). I couldn't get the same type of error as you, but I do believe fixing the arena will fix what you were getting.

zigster64 commented 1 month ago

Good detective work !

I think the problem is related to default_allocator getting out of whack with allocator in the reader obj

If I use default_allocator here, it doesnt crash anymore

                            if (is_static or !allocator.resize(buf, message_length)) {
                                // Either we were using our static buffer or resizing failed
                                lib.metrics.allocReader(message_length);
                                std.debug.print("alloc {} using allocator {*} - {*}\n", .{ message_length, allocator.ptr, self.default_allocator.ptr });
                                new_buf = try self.default_allocator.alloc(u8, message_length);  // Default allocator
                                @memcpy(new_buf[0..current_length], buf[start..pos]);

Because is_static is true, and default_allocator owns the static buffer

at this point in the code as well, the values of allocator and default_allocator are not the same (which is unexpected, because im not using a custom allocator on the query ??)

zigster64 commented 1 month ago

ok, I have run it using your latest patch (with the arena fix)

... exact same bus error :(

I applied the hacky change to use self.default_allocator on top of your patch, and ... no bus error.

Documentation/Comments are good, but Im still digesting the existence of 2 allocators in the reader, and how they get swapped around. Im struggling to understand yet why they have drifted apart into 2 different values by the time of the crash

karlseguin commented 1 month ago

What's the main entry point? conn.query, query.row, pool.query, pool.row ?

These functions internally create a Stmt, which creates an arena and calls reader.startFlow. Assuming the stmt is properly prepared and values bound, the ownership/responsibility for the arena and calling endFlow gets passed to the result.

You don't have a reproducible case I can use by any chance?

zigster64 commented 1 month ago

conn.query -> result.next

The general code structure is like this :

const DataList = struct {
    const sql_statement =
        \\ SELECT all the things ...
    ;

    conn: ?*pg.Conn = null,
    result: ?pg.Result = null,

    pub const Row = struct {
       ... a big struct of field mapped to the query columns     
    };

    pub fn query(db_pool: *pg.Pool, params: SomeOtherStruct) !DataList {
        var conn = try db_pool.acquire();

        const res = conn.query(sql_statement, .{params}) catch |err| {
            if (conn.err) |pge| {
                logz.err()
                    .src(@src())
                    .string("error", pge.message)
                    .log();
            }
            return err;
        };
        return .{
            .conn = conn,
            .result = res,
        };
    }

    pub fn deinit(self: *DataList) void {
        if (self.result) |result| result.deinit();
        if (self.conn) |conn| conn.release();
    }

    pub fn next(self: *DataList) !?DataList.Row {
        if (self.result) |*result| {
            if (try result.next()) |row| {
                return .{
                    ... each field = row.get(type, i)
                    .some_iterator = row.iterator([]const u8, i),   // the iterator that has too much data !
                   ....
                };
            }
        }
        return null;
    }
};

gtg - I will put some time in this weekend to create a test case for it (which we will need anyway once a fix is in)

zigster64 commented 1 month ago

I can see the logic now, why reader.zig 188 is the way it is

self.allocator is the arena created by startFlow. That makes sense.

so the dynamic buffer lives on the arena, and the static buffer on the base allocator

note in reader.deinit() that it frees the dynamic buffer from the base allocator, not the arena. Could this be corrupting the base allocator then ?

this also explains why creating a new self.buf on the base allocator on line 188 removes the bus error. Not suggesting that’s a correct solution.. just that it explains the behaviour I’m seeing

also - I think 188 should allocate message_length + current_length .. as it needs to add message_length new bytes to the current buffer.

I think 222 ends up reading short of the full msg because there is not enough room, then it mistakenly reads the next bit as a msg length. Not sure yet.

zigster64 commented 1 month ago

Ok, I’m pretty confident that latest pr fixes the problem

It’s a pity we can’t name that alternate allocator “reader.arena” .. because it’s not always an arena :)

karlseguin commented 1 month ago

I don't think we need the message_length + current_length. At the point this is called, we don't care about any past data, we just care about the next message, so we only need message_length + 1 (+1 for the type byte which isn't in the length)

zigster64 commented 1 month ago

Yep, I took that out before. Works fine now

zigster64 commented 4 weeks ago

Can close this now if you like - the memory handling has had a number of improvements since this was bought up. Its also easier to reason about now - thanks !

I have re-written my app code to use the following pattern, because all the pg queries are triggered by http handlers

http handler -> acquire connection -> run query, using res.arena as the allocator -> deinit result -> release connection

No problems ! Is probably a little quicker as well - anecdotally some of my larger queries seem to be a few ms quicker maybe

karlseguin commented 4 weeks ago

If you're tracking httpz master, the thread_pool.buffer_size config is the single most important optjon if you're using req.arena or res.arena. It defaults to 8192 and the total memory usage will be thread_pool.count * thread_pool.buffer_size, which makes it super nice because count is typically quite small (i.e. 2-16), so you can have a very large buffer_size.