Closed argosphil closed 8 months ago
Here's the valgrind output:
==25862== Conditional jump or move depends on uninitialised value(s)
==25862== at 0x936A047: us_loop_run_bun_tick (packages/bun-usockets/src/eventing/epoll_kqueue.c:186)
==25862== by 0x6A558C4: src.deps.uws.PosixLoop.tick (uws.zig:1019)
==25862== by 0x6E0F1D5: src.bun.js.event_loop.EventLoop.autoTick (event_loop.zig:1226)
==25862== by 0x730E504: src.bun.js.javascript.VirtualMachine.loadEntryPoint (javascript.zig:2294)
==25862== by 0x6DA94BF: src.bun_js.Run.start (bun_js.zig:307)
==25862== by 0x68E7252: src.bun.js.javascript.OpaqueWrap__anon_48093__struct_86064.callback (javascript.zig:105)
==25862== by 0x8CB88CB: JSC__VM__holdAPILock (src/bun.js/bindings/bindings.cpp:4614)
==25862== by 0x64DAB05: cppFn (shimmer.zig:186)
==25862== by 0x64DAB05: src.bun.js.bindings.bindings.VM.holdAPILock (bindings.zig:5400)
==25862== by 0x683B2AB: src.bun_js.Run.boot (bun_js.zig:278)
==25862== by 0x683EAB3: src.cli.run_command.RunCommand.exec__anon_69168 (run_command.zig:1230)
==25862== by 0x685B175: src.cli.Command.start (cli.zig:1631)
==25862== by 0x64130D2: src.cli.Cli.start__anon_5318 (cli.zig:57)
Obtained by running this program:
Bun.listen({
fd: 0,
socket: {
data(socket, data) { console.log(data) },
},
})
Bun.serve({
async fetch() {
await new Promise(r => r);
}
})
with valgrind --error-limit=no ./build-valgrind/bun-debug run serve.js
This seems fixed by #9191 .
Please file a bug in https://github.com/uNetworking/uSockets
I don't have any valgrind issues in by branch:
valgrind ./EchoServer ==4308== Memcheck, a memory error detector ==4308== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==4308== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info ==4308== Command: ./EchoServer ==4308== Listening on port 9001
This is most likely a divergence, for instance us_loop_run_bun_tick does not even exist upstream
several places in usockets which assume malloc()ed memory is cleared (all pointers set to NULL). This assumption is incorrect.
There is no such assumption. All fields that are relied upon to branch should be explicitly initialized. Setting everything to 0 does not guarantee proper behavior, so swapping malloc for calloc as a quick fix is not a good idea. Better to find out the real issue, but like posted above, I have no such issue in my branch, so can't help with that
Please file a bug in https://github.com/uNetworking/uSockets
Will do. At first glance, it looks like the bug is present there as well.
Please file a bug in https://github.com/uNetworking/uSockets
Will do. At first glance, it looks like the bug is present there as well.
At second glance, the only potential buglet I still see is a harmless memory overallocation in us_internal_create_child_ssl_socket_context
, which appears to account for the us_socket_context_t
size twice. Do you want me to confirm and report that?
I said:
There are several places in usockets which assume malloc()ed memory is cleared (all pointers set to NULL).
What I should have written was:
There are several places in bun's usocket-related code which assume malloc()ed memory is cleared (all pointers set to NULL).
This mistake on my part led to @uNetworkingAB asking me to report a bug in usockets, but at the time, I was unaware of any such bug.
Please accept my apologies for not being clearer about who's at fault here.
As for whether partially-initialized structures with such undocumented rules as "context->on_long_timeout is ignored when socket->long_timeout is 255 for all sockets with us_socket_context(socket) == context, which is ensured because the user can be relied upon never to call us_socket_long_timeout(socket) without a preceding us_socket_context_on_timeout(context)" are a good idea, well, they're not. Among other reasons, it's worth keeping in mind that unconditional full-cacheline memory writes are cheaper than partial-cacheline writes, which may need to retrieve the irrelevant uninitialized data from outer caches.
At second glance, the only potential buglet I still see is a harmless memory overallocation in us_internal_create_child_ssl_socket_context, which appears to account for the us_socket_context_t size twice. Do you want me to confirm and report that?
Thanks for checking twice, you can report any such issues if you like to.
Please accept my apologies for not being clearer about who's at fault here.
Ah, that was never meant as a blame game, just informing that upstream should not have any such issues since we do a lot of testing for these issues and bun has diverged a bit since.
because the user can be relied upon never to call us_socket_long_timeout(socket) without a preceding us_socket_context_on_timeout(context)" are a good idea, well, they're not.
I don't follow the point here? You must have a registered us_socket_context_on_timeout event handler before you set any timers.
What version of Bun is running?
No response
What platform is your computer?
No response
What steps can reproduce the bug?
Run
bun-profile
in a debugger and set a breakpoint onus_internal_timer_sweep
. Open a socket.What is the expected behavior?
us_internal_timer_sweep
is called consistently.What do you see instead?
us_internal_timer_sweep
is called sometimesAdditional information
There are several places in usockets which assume malloc()ed memory is cleared (all pointers set to NULL). This assumption is incorrect.
In this specific case, the problem manifests itself in
epoll_kqueue.c:us_loop_run_bun_tick
, whereus_loop_integrate
is called only iftimer_callback->cb
is NULL.timer_callback->cb
is in fact uninitialized.A very similar bug leads to an app crashing after 16 minutes. (Discussion on Discord).
My proposal is to change
us_malloc
to callcalloc
until these issues are fixed upstream. This seems to fix the 16-minute crash.