Segmentation fault in `TCPSocketClass__finalize`

oven-sh / bun

Incredibly fast JavaScript runtime, bundler, test runner, and package manager – all in one

https://bun.sh

Other

73.11k stars 2.67k forks source link

Segmentation fault in `TCPSocketClass__finalize` #13512

Open bmteller opened 2 weeks ago

bmteller commented 2 weeks ago

How can we reproduce the crash?

I'm going to try and generate a minimal reproduction but don't have it at the moment.

Relevant log output

The address always seems to be low:

panic(main thread): Segmentation fault at address 0x6A62
panic(main thread): Segmentation fault at address 0x222B1
panic(main thread): Segmentation fault at address 0x5317
panic(main thread): Segmentation fault at address 0x342

Stack Trace (bun.report)

Bun v1.1.24 (85a3299) on linux aarch64 [RunCommand]

Segmentation fault at address 0x00005345

uws.zig:117: TCPSocketClass__finalize
uws.zig:117: TCPSocketClass__finalize
ZigGeneratedClasses.cpp:38416: WebCore::JSTCPSocket::destroy
ld-temp.o:0: JSC::PreciseAllocation::sweep
ld-temp.o:0: JSC::Heap::handleNeedFinalize
ld-temp.o:0: JSC::Heap::finishChangingPhase
ld-temp.o:0: JSC::Heap::runEndPhase
ld-temp.o:0: JSC::Heap::runCurrentPhase

github-actions[bot] commented 2 weeks ago

@bmteller, the latest version of Bun is v1.1.25, but this crash was reported on Bun v1.1.24.

Are you able to reproduce this crash on the latest version of Bun?

bun upgrade

Jarred-Sumner commented 2 weeks ago

Pretty sure this crash was fixed in Bun v1.1.25

Please leave a comment if you still run into it

bmteller commented 2 weeks ago

had the same issue with 1.1.26:

Bun v1.1.26 (0a37423) on linux aarch64 [RunCommand]

Segmentation fault at address 0x00000A72

uws.zig:117: TCPSocketClass__finalize
uws.zig:117: TCPSocketClass__finalize
ZigGeneratedClasses.cpp:40060: WebCore::JSTCPSocket::destroy
ld-temp.o:0: JSC::PreciseAllocation::sweep
ld-temp.o:0: JSC::Heap::handleNeedFinalize
ld-temp.o:0: JSC::Heap::finishChangingPhase
ld-temp.o:0: JSC::Heap::runEndPhase
ld-temp.o:0: JSC::Heap::runCurrentPhase

Features: jsc, Bun.stdin, fetch, http_server, spawn, tsconfig_paths, tsconfig

bmteller commented 2 weeks ago

The finalise code in src/bun.js/api/bun/socket.zig looks a bit suspicious.

If you trace the code in the case where finalise is called and the socket is open it looks like there is a read-after-free error. but maybe I'm looking at the code incorrectly because I assume that would mean any kind of 'socket' leak where the GC ends up cleaning the socket instead of it being explicitly closed would cause a crash and this crash doesn't look that common.

            finalize:

            this.flags.finalizing = true;
            if (!this.socket.isClosed()) {
                this.closeAndDetach(.failure);

            closeAndDetach:

            const socket = this.socket;
            this.socket.detach();
            socket.close(code);

            onClose:

            this.socket.detach();
            defer this.deref();
            defer this.markInactive();

            if (this.flags.finalizing) {
                return;
            }

            markInactive():
              ignored in this analysis

            deref:
              deinit: (if refcount is zero?)
                this.destroy()
                (i assume memory should not be accessed after this point)

            back to finalize():

            this.deref();
            (accesses memory that is undefined)

bmteller commented 2 weeks ago

In the title I said it might be related to Bun.serve but it could also be related to Bun.fetch. I think its hitting the happy eyeball code path because its trying to fetch() a localhost url and in the container localhost resolves to ::1 and 127.0.0.1.

Jarred-Sumner commented 2 weeks ago

It has to be related TCP sockets in JS, which usually means a database connection of some kind

and yeah, happy eyeballs is plausible - that it’s another edge case involving connection errors with TCP

bmteller commented 2 weeks ago

So this was originally happening in production but it was not easily able to reproduce it locally. But I have now reproduced it locally but I'm not sure if I'll be able to create a minimal reproduction.

So the Bun app receives 'real' http requests which involve it making fetch requests to another app and also interaction with a complicated node library which uses a pipe to communicate with another process. I guess similar to nodejs I think the pipe is not really a pipe but a unix domain socket. The 'bun' app also receives 'health' check http requests from a load balancer. Locally, I was able to reproduced it by continually making the load balancer health requests every 5 seconds in the background and also making a number of 'real' requests that were spaced out. I think I made 4 requests in total the last request which causes the segmentation fault was about 70 minutes after the previous request.

Running locally the hosts are ipv4 only so there is no happy eyeballs. It also excludes some weirdness because in production there is no internet access so if something was crashing because it could not connect() then that would not happen locally.

We are also using brotliDecompress so I guess potentially that could be an issue as well if it's bugged because it could be trashing the heap. But I guess a lot of the edge cases in brotli decompression implementation would show up if you are streaming the decompression and we just decompress a single Buffer.

We are not able to share the docker container that reproduces this because it contains a lot of proprietary code. But if you have any tips on creating a debug build of bun in order to trace the issue I can try and trace the issue.

cirospaciari commented 2 weeks ago

Can you share what libraries/packages are you using?

So this was originally happening in production but it was not easily able to reproduce it locally. But I have now reproduced it locally but I'm not sure if I'll be able to create a minimal reproduction.

So the Bun app receives 'real' http requests which involve it making fetch requests to another app and also interaction with a complicated node library which uses a pipe to communicate with another process. I guess similar to nodejs I think the pipe is not really a pipe but a unix domain socket. The 'bun' app also receives 'health' check http requests from a load balancer. Locally, I was able to reproduced it by continually making the load balancer health requests every 5 seconds in the background and also making a number of 'real' requests that were spaced out. I think I made 4 requests in total the last request which causes the segmentation fault was about 70 minutes after the previous request.

Running locally the hosts are ipv4 only so there is no happy eyeballs. It also excludes some weirdness because in production there is no internet access so if something was crashing because it could not connect() then that would not happen locally.

We are also using brotliDecompress so I guess potentially that could be an issue as well if it's bugged because it could be trashing the heap. But I guess a lot of the edge cases in brotli decompression implementation would show up if you are streaming the decompression and we just decompress a single Buffer.

We are not able to share the docker container that reproduces this because it contains a lot of proprietary code. But if you have any tips on creating a debug build of bun in order to trace the issue I can try and trace the issue.