karlseguin / pg.zig

Native PostgreSQL driver / client for Zig
MIT License
214 stars 16 forks source link

Free dynamic buffer #22

Closed zigster64 closed 3 months ago

zigster64 commented 3 months ago

This little change makes more sense

When there is a dynamic buffer, its always created on the arena, not the base allocator.

So on de-init, it should also be freed from the arena instead of the base allocator.

Tested this on :

In all cases, it no longer generates a bus error. Without this patch, I get crashes in different places - bus error on arm, ptrCasts inside the arena code on x86

Id say thats what the problem was - freeing from the wrong allocator was corrupting the allocator and causing an early exit. The debug statements I werent seeing were probably sitting unflushed in the stderr buffer or something ?

karlseguin commented 3 months ago

Still isn't clear to me, and still struggling with it.

Assuming a normal execution of startFlow() -> anything -> endFlow() then I feel like this PR doesn't change anything (which is fine), because endFlow() should have freed the dynamic buffer using the correct allocator. So self.static.ptr != self.buf.ptr should never be true, and that deallocate should never happen. Furthermore, endFlow() sets self.allocator = self.default_allocator, so whether you use self.allocator or self.default_allocator in deinit, doesn't matter.

So I'm thinking that this is all happening in some error flow, where startFlow() is called but endFlow() isn't: startFlow() -> anything -> conn.deinit(). conn.deinit() being the only thing that calls reader.deinit(). In this context, I can see how your PR makes sense. But I don't understand how this manifested itself as a bus error when reading row data. I would expect it to crash on conn.release(), which would detect that the connection is an invalid/error state and, rather than return it to the pool, would call deinit(). I'd also expect some other code (like a conn.queries() or rows.next()) to return an error. I'm going to play with reproducing this case, which should be doable...and my concern here is that self.allocator might no longer be valid by the time client.deinit() is called. Not sure.

karlseguin commented 3 months ago

FWIW, there are 2 cases where a dynamic buffer can be created with default_allocator. But I'm still thinking that the normal flow (startFlow -> * -> endFlow) is ok with them, either with or without this PR.

The first is simple. When exec is called with no parameters to bind, if an allocator isn't passed to execOpts, a Stmt isn't created, an arena isn't create, thus the default_allocator is used. Since there's only ever 1 allocator in this case, I think it's fine.

The more complicated case is that when we have an arena allocator and allocate a dynamic buffer, we might over-read data into that buffer and that data might not belong to this "startFlow" -> "endFlow" sequence. The bulk of endFlow deals with this case. In the simple case, we can copy that over-read back into static. But if our over-read is larger than static, we need to copy it into another dynamic buffer owned by default_allocator. I think your PR is fine here beause this only ever happens when endFlow is called, and I still think that if endFlow is called, things are in an ok state.

zigster64 commented 3 months ago

I share your bepuzzlement

if there is an overread then the new static is allocated on the arena (original line 188), then reader deinit was freeing the new static from the base allocator instead of the arena, if static had grown. I’m pretty sure that would corrupt the original allocator.

whem my app starts up, it performs several connect -> read -> close before it gets to the killer statement

I also suspect that once an allocator is in a trashed state, then any stack traces or buffered output for logging may be unreliable to gauge what’s going on.

will keep probing, because it feels like there must be at least another spot where it’s causing the issue - hopefully it’s in my app code, not the lib code. Need to sleep on it I think

karlseguin commented 3 months ago

I improved the fuzz tester (but still couldn't reproduce it, so I guess I didn't improve it enough).

And I simplified the dynamic allocation by using realloc instead of resize. realloc essentially does what the old code does: handling both the success and failure cases for resize.

I think I know what's going on...mostly because I think there's only 1 possibility.

result.deinit and stmt.deinit both ignore any errors thrown by endFlow. Something like:

self._conn._reader.endFlow() catch {
  self._conn._state = .fail;
};

If you're still able to reproduce (by reverting your patch), can you remove that catch {} and see if you get a different error (hopefully in endFlow). If it does fail (a) we should see if we can fix that failure and (b) I believe we can tighten endFlow to free the buffer even error cases (by moving the free higher with a defer).

karlseguin commented 3 months ago

I guess not remove the catch, since that would require a lot of changes downstream (since deinit would need to return an error..maybe @panic in the catch, or find some other way to see if I'm right and endFlow is failing (and what that failure is).