oven-sh / bun

Incredibly fast JavaScript runtime, bundler, test runner, and package manager – all in one
https://bun.sh
Other
74.25k stars 2.77k forks source link

bun-usockets behaves unpredictably: uninitialized memory used #9181

Closed argosphil closed 8 months ago

argosphil commented 8 months ago

What version of Bun is running?

No response

What platform is your computer?

No response

What steps can reproduce the bug?

Run bun-profile in a debugger and set a breakpoint on us_internal_timer_sweep. Open a socket.

What is the expected behavior?

us_internal_timer_sweep is called consistently.

What do you see instead?

us_internal_timer_sweep is called sometimes

Additional information

There are several places in usockets which assume malloc()ed memory is cleared (all pointers set to NULL). This assumption is incorrect.

In this specific case, the problem manifests itself in epoll_kqueue.c:us_loop_run_bun_tick, where us_loop_integrate is called only if timer_callback->cb is NULL. timer_callback->cb is in fact uninitialized.

A very similar bug leads to an app crashing after 16 minutes. (Discussion on Discord).

My proposal is to change us_malloc to call calloc until these issues are fixed upstream. This seems to fix the 16-minute crash.

argosphil commented 8 months ago

Here's the valgrind output:

==25862== Conditional jump or move depends on uninitialised value(s)
==25862==    at 0x936A047: us_loop_run_bun_tick (packages/bun-usockets/src/eventing/epoll_kqueue.c:186)
==25862==    by 0x6A558C4: src.deps.uws.PosixLoop.tick (uws.zig:1019)
==25862==    by 0x6E0F1D5: src.bun.js.event_loop.EventLoop.autoTick (event_loop.zig:1226)
==25862==    by 0x730E504: src.bun.js.javascript.VirtualMachine.loadEntryPoint (javascript.zig:2294)
==25862==    by 0x6DA94BF: src.bun_js.Run.start (bun_js.zig:307)
==25862==    by 0x68E7252: src.bun.js.javascript.OpaqueWrap__anon_48093__struct_86064.callback (javascript.zig:105)
==25862==    by 0x8CB88CB: JSC__VM__holdAPILock (src/bun.js/bindings/bindings.cpp:4614)
==25862==    by 0x64DAB05: cppFn (shimmer.zig:186)
==25862==    by 0x64DAB05: src.bun.js.bindings.bindings.VM.holdAPILock (bindings.zig:5400)
==25862==    by 0x683B2AB: src.bun_js.Run.boot (bun_js.zig:278)
==25862==    by 0x683EAB3: src.cli.run_command.RunCommand.exec__anon_69168 (run_command.zig:1230)
==25862==    by 0x685B175: src.cli.Command.start (cli.zig:1631)
==25862==    by 0x64130D2: src.cli.Cli.start__anon_5318 (cli.zig:57)

Obtained by running this program:

Bun.listen({
    fd: 0,
    socket: {
    data(socket, data) { console.log(data) },
    },
})

Bun.serve({
    async fetch() {
    await new Promise(r => r);
    }
})

with valgrind --error-limit=no ./build-valgrind/bun-debug run serve.js

argosphil commented 8 months ago

This seems fixed by #9191 .

uNetworkingAB commented 8 months ago

Please file a bug in https://github.com/uNetworking/uSockets

uNetworkingAB commented 8 months ago

I don't have any valgrind issues in by branch:

valgrind ./EchoServer ==4308== Memcheck, a memory error detector ==4308== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==4308== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info ==4308== Command: ./EchoServer ==4308== Listening on port 9001

This is most likely a divergence, for instance us_loop_run_bun_tick does not even exist upstream

uNetworkingAB commented 8 months ago

several places in usockets which assume malloc()ed memory is cleared (all pointers set to NULL). This assumption is incorrect.

There is no such assumption. All fields that are relied upon to branch should be explicitly initialized. Setting everything to 0 does not guarantee proper behavior, so swapping malloc for calloc as a quick fix is not a good idea. Better to find out the real issue, but like posted above, I have no such issue in my branch, so can't help with that

argosphil commented 8 months ago

Please file a bug in https://github.com/uNetworking/uSockets

Will do. At first glance, it looks like the bug is present there as well.

argosphil commented 8 months ago

Please file a bug in https://github.com/uNetworking/uSockets

Will do. At first glance, it looks like the bug is present there as well.

At second glance, the only potential buglet I still see is a harmless memory overallocation in us_internal_create_child_ssl_socket_context, which appears to account for the us_socket_context_t size twice. Do you want me to confirm and report that?

I said:

There are several places in usockets which assume malloc()ed memory is cleared (all pointers set to NULL).

What I should have written was:

There are several places in bun's usocket-related code which assume malloc()ed memory is cleared (all pointers set to NULL).

This mistake on my part led to @uNetworkingAB asking me to report a bug in usockets, but at the time, I was unaware of any such bug.

Please accept my apologies for not being clearer about who's at fault here.

As for whether partially-initialized structures with such undocumented rules as "context->on_long_timeout is ignored when socket->long_timeout is 255 for all sockets with us_socket_context(socket) == context, which is ensured because the user can be relied upon never to call us_socket_long_timeout(socket) without a preceding us_socket_context_on_timeout(context)" are a good idea, well, they're not. Among other reasons, it's worth keeping in mind that unconditional full-cacheline memory writes are cheaper than partial-cacheline writes, which may need to retrieve the irrelevant uninitialized data from outer caches.

uNetworkingAB commented 8 months ago

At second glance, the only potential buglet I still see is a harmless memory overallocation in us_internal_create_child_ssl_socket_context, which appears to account for the us_socket_context_t size twice. Do you want me to confirm and report that?

Thanks for checking twice, you can report any such issues if you like to.

Please accept my apologies for not being clearer about who's at fault here.

Ah, that was never meant as a blame game, just informing that upstream should not have any such issues since we do a lot of testing for these issues and bun has diverged a bit since.

because the user can be relied upon never to call us_socket_long_timeout(socket) without a preceding us_socket_context_on_timeout(context)" are a good idea, well, they're not.

I don't follow the point here? You must have a registered us_socket_context_on_timeout event handler before you set any timers.