Closed gavin-norman-sociomantic closed 7 years ago
The point is that GC malloc tries to allocate 140737304053504 bytes.
Looking at the stack trace:
1 0x0000000000766515 in gc.cdgc.gc.calloc(ulong, uint, ulong*) (pm_bitmask=0x61, attrs=97, size=97) at /home/jenkins/docker/src/gc/cdgc/gc.d:1600
The file / line are wrong, but that function is here.
It passes size
to malloc verbatim, so how come gdb shows:
gc.cdgc.gc.malloc(ulong, uint, ulong*) (pm_bitmask=0xa73230
, attrs=638, size=140737304053504) at /home/jenkins/docker/src/gc/cdgc/gc.d:1521
Either gdb is wrong and could make us look at the wrong place (wouldn't be surprised), memory is corrupted, or the generated code is wrong and we're hitting a special case.
Honestly, I wouldn't be surprised if that stack trace turned to be wrong and the GC has nothing to do with it.
@gavin-norman-sociomantic what tango runtime version/build are you using when generating this stacktrace?
I ask because when running with a debug build of tango (thanks to @nemanja-boric-sociomantic), I get a different (but clearly similar) stacktrace:
Program terminated with signal SIGSEGV, Segmentation fault.
#0 gc.cdgc.gc.malloc(ulong, uint, ulong*) (pm_bitmask=0xa73a30 <internal>, attrs=0, size=97) at /home/nemanjaboric/work/tangort/src/gc/cdgc/gc.d:1520
1520 /home/nemanjaboric/work/tangort/src/gc/cdgc/gc.d: No such file or directory.
(gdb) bt
#0 gc.cdgc.gc.malloc(ulong, uint, ulong*) (pm_bitmask=0xa73a30 <internal>, attrs=0, size=97) at /home/nemanjaboric/work/tangort/src/gc/cdgc/gc.d:1520
#1 0x000000000076810f in gc.cdgc.gc.calloc(ulong, uint, ulong*) (pm_bitmask=0xa73a30 <internal>, attrs=0, size=97) at /home/nemanjaboric/work/tangort/src/gc/cdgc/gc.d:1599
#2 0x0000000000763940 in gc.cdgc.gc.gc_calloc(ulong, uint, object.PointerMap).locked!(void*, gc.cdgc.gc.gc_calloc(ulong, uint, object.PointerMap).__dgliteral123()).locked().__dgliteral123() (this=0x7f8f798c5a10) at /home/nemanjaboric/work/tangort/src/gc/cdgc/gc.d:2604
#3 0x00000000007638b4 in gc.cdgc.gc.gc_calloc(ulong, uint, object.PointerMap).locked!(void*, gc.cdgc.gc.gc_calloc(ulong, uint, object.PointerMap).__dgliteral123()).locked() (this=0x7f8f798c5a10) at /home/nemanjaboric/work/tangort/src/gc/cdgc/gc.d:268
#4 0x0000000000763586 in gc_calloc (size=97, attrs=0, ptrmap=...) at /home/nemanjaboric/work/tangort/src/gc/cdgc/gc.d:2599
#5 0x000000000075b3e6 in _d_arraysetlengthT (ti=0xa5a640 <TypeInfo_AS5swarm3neo8protocol6socket9uio_const11iovec_const.init$>, newlength=6, p=0x7f8f7eba42f0) at /home/nemanjaboric/work/tangort/src/rt/lifetime.d:756
#6 0x00000000006e1409 in swarm.neo.protocol.socket.MessageGenerator.IoVecMessage.setup(swarm.neo.protocol.Message.MessageType, void[][], void[][]...) (this=0x7f8f7eba42f0, static_fields=..., dynamic_fields=..., type=1 '\001')
at ./submodules/swarm/src/swarm/neo/protocol/socket/MessageGenerator.d:62
#7 0x0000000000692e5f in swarm.neo.protocol.socket.MessageSender.MessageSender.assign(swarm.neo.protocol.Message.MessageType, void[][], void[][]...) (this=0x7f8f7eba4200, static_fields=..., dynamic_fields=..., type=1 '\001')
at ./submodules/swarm/src/swarm/neo/protocol/socket/MessageSender.d:166
#8 0x000000000070e7e3 in swarm.neo.connection.ConnectionBase.ConnectionBase.SendLoop.sendRequestPayload(ulong).__dgliteral567(void[][]) (this=0x7f8f798c5dc0, payload=...) at ./submodules/swarm/src/swarm/neo/connection/ConnectionBase.d:423
#9 0x00000000006dffb0 in swarm.neo.connection.RequestOnConnBase.RequestOnConnBase.getPayloadForSending(void(void[][]) delegate) (this=0x7f8f7c534600, send=...) at ./submodules/swarm/src/swarm/neo/connection/RequestOnConnBase.d:1734
#10 0x0000000000751507 in swarm.neo.node.RequestSet.RequestSet.Request.getPayloadForSending(void(void[][]) delegate) (this=0x7f8f7c534600, send=...) at ./submodules/swarm/src/swarm/neo/node/RequestSet.d:148
#11 0x0000000000726fea in swarm.neo.node.Connection.Connection.getPayloadForSending(ulong, void(void[][]) delegate) (this=0x7f8f7c50bb00, send=..., id=11) at ./submodules/swarm/src/swarm/neo/node/Connection.d:216
#12 0x000000000070e6ee in swarm.neo.connection.ConnectionBase.ConnectionBase.SendLoop.sendRequestPayload(ulong) (this=0x7f8f7c50b600, id=11) at ./submodules/swarm/src/swarm/neo/connection/ConnectionBase.d:419
#13 0x000000000070e6a2 in swarm.neo.connection.ConnectionBase.ConnectionBase.SendLoop.loop().__foreachbody1902(ref ulong) (this=0x7f8f798c5f08, __applyArg0=0x7f8f798c5e60) at ./submodules/swarm/src/swarm/neo/connection/ConnectionBase.d:387
#14 0x00000000006e20d7 in swarm.neo.util.TreeQueue.TreeQueueCore.opApply(int(ref ulong) delegate) (this=0x7f8f7c50b660, dg=...) at ./submodules/swarm/src/swarm/neo/util/TreeQueue.d:462
#15 0x000000000070e5fb in swarm.neo.connection.ConnectionBase.ConnectionBase.SendLoop.loop() (this=0x7f8f7c50b600) at ./submodules/swarm/src/swarm/neo/connection/ConnectionBase.d:386
#16 0x000000000070e55f in swarm.neo.connection.ConnectionBase.ConnectionBase.SendLoop.fiberMethod() (this=0x7f8f7c50b600) at ./submodules/swarm/src/swarm/neo/connection/ConnectionBase.d:320
#17 0x000000000076cfe3 in core.thread.Fiber.run() (this=0x7f8f7c4f4280) at /home/nemanjaboric/work/tangort/src/core/thread.d:3200
#18 0x000000000076cf01 in fiber_entryPoint () at /home/nemanjaboric/work/tangort/src/core/thread.d:2489
#19 0x0000000000000000 in ?? ()
Starting and stopping a writer client at the same time as the Mirrors are going doesn't cause crashes.
what tango runtime version/build are you using when generating this stacktrace?
I'm using the standard (release) runtime.
Testing this, I just got a different segfault:
Program received signal SIGSEGV, Segmentation fault.
0x0000000000672e94 in dhtnode.request.neo.Mirror.MirrorImpl_v0.startIteration() (this=0x7fffe4803d78) at ./src/dhtnode/request/neo/Mirror.d:199
199 this.iterator.setStorage(this.channel);
(gdb) bt
#0 0x0000000000672e94 in dhtnode.request.neo.Mirror.MirrorImpl_v0.startIteration() (this=0x7fffe4803d78) at ./src/dhtnode/request/neo/Mirror.d:199
#1 0x0000000000689a36 in dhtproto.node.neo.request.Mirror.MirrorProtocol_v0.PeriodicRefresh.refresh() (this=0x7fffe4803be0)
at ./submodules/dhtproto/src/dhtproto/node/neo/request/Mirror.d:996
#2 0x0000000000689856 in dhtproto.node.neo.request.Mirror.MirrorProtocol_v0.PeriodicRefresh.fiberMethod() (this=0x7fffe4803be0)
at ./submodules/dhtproto/src/dhtproto/node/neo/request/Mirror.d:945
#3 0x000000000076b6ae in core.thread.Fiber.run() (this=0xa56540 <vtable for ocean.io.Stdout.TerminalOutput!(char).TerminalOutput>)
at /home/jenkins/docker/src/core/thread.d:3200
#4 0x000000000076b608 in fiber_entryPoint () at /home/jenkins/docker/src/core/thread.d:2489
#5 0x0000000000000000 in ?? ()
Not sure if this is related.
I managed to get this one as well (also in GC, but not the same as our first):
(gdb) bt
#0 0x00000000007696c9 in gc.cdgc.bits.GCBits.clear(ulong) (this=0x101,
i=140737264569472) at /home/jenkins/docker/src/gc/cdgc/bits.d:131
#1 0x00000000000005ed in ?? ()
#2 0x00007ffff50b0040 in ?? ()
#3 0x00000000002e345c in ?? ()
#4 0x0000000000765597 in gc.cdgc.gc.malloc(ulong, uint, ulong*) (
pm_bitmask=0xa71f90 <internal>, attrs=94626, size=28)
at /home/jenkins/docker/src/gc/cdgc/gc.d:1523
#5 0x0000000000761d0a in gc.cdgc.gc.gc_malloc(ulong, uint, object.PointerMap).locked!(void*, gc.cdgc.gc.gc_malloc(ulong, uint, object.PointerMap).__dgliteral121()).locked().__dgliteral121() (this=0x7ffff2a94d00) at /home/jenkins/docker/src/gc/cdgc/gc.d:2590
#6 0x0000000000761c5e in gc.cdgc.gc.gc_malloc(ulong, uint, object.PointerMap).locked!(void*, gc.cdgc.gc.gc_malloc(ulong, uint, object.PointerMap).__dgliteral121()).locked() (
this=0x7ffff2a94d00) at /home/jenkins/docker/src/gc/cdgc/gc.d:268
#7 0x0000000000761ae5 in gc_malloc (size=257, attrs=2, ptrmap=...)
at /home/jenkins/docker/src/gc/cdgc/gc.d:2585
#8 0x000000000075f382 in _d_arrayappendcTX (ti=0x20, x=0x266b752d74636572, n=1)
at /home/jenkins/docker/src/rt/lifetime.d:946
#9 0x00000000006898f9 in dhtproto.node.neo.request.Mirror.MirrorProtocol_v0.Queue!(ulong).Queue.push(ulong) (this=0x7ffff2ac7da8, e=1495586554)
at ./submodules/dhtproto/src/dhtproto/node/neo/request/Mirror.d:98
#10 0x00000000006887a5 in dhtproto.node.neo.request.Mirror.MirrorProtocol_v0.refreshed(ulong) (this=0x7ffff2ac7d78, key=1495586554)
at ./submodules/dhtproto/src/dhtproto/node/neo/request/Mirror.d:384
#11 0x00000000006894e2 in dhtproto.node.neo.request.Mirror.MirrorProtocol_v0.PeriodicRefresh.refresh() (this=0x7ffff2ac7be0)
at ./submodules/dhtproto/src/dhtproto/node/neo/request/Mirror.d:998
#12 0x00000000006892c4 in dhtproto.node.neo.request.Mirror.MirrorProtocol_v0.PeriodicRefresh.fiberMethod() (this=0x7ffff2ac7be0)
at ./submodules/dhtproto/src/dhtproto/node/neo/request/Mirror.d:930
#13 0x000000000076ac3e in core.thread.Fiber.run() (this=0x266b752d74636572)
at /home/jenkins/docker/src/core/thread.d:3200
#14 0x000000000076ab98 in fiber_entryPoint ()
at /home/jenkins/docker/src/core/thread.d:2489
#15 0x0000000000000000 in ?? ()
(gdb)
OK, this seems to get it a bit closer. I've put the scope (exit) Stderr.formatln("exiting refresh {}", this.outer.channel)
here:
https://github.com/sociomantic-tsunami/dhtproto/blob/neo/src/dhtproto/node/neo/request/Mirror.d#L952
Now, go and start starting stopping the mirror requests:
Exiting Mirror.refresh fiber test
Exiting Mirror.refresh fiber test1
Exiting Mirror.refresh fiber test2
Exiting Mirror.refresh fiber test3
Exiting Mirror.refresh fiber test4
Exiting Mirror.refresh fiber test5
Exiting Mirror.refresh fiber test
Exiting Mirror.refresh fiber test1
Exiting Mirror.refresh fiber test2
Exiting Mirror.refresh fiber test3
Exiting Mirror.refresh fiber test4
Exiting Mirror.refresh fiber test5
Exiting Mirror.refresh fiber test
Exiting Mirror.refresh fiber test1
Exiting Mirror.refresh fiber test2
Exiting Mirror.refresh fiber test3
Exiting Mirror.refresh fiber test4
Exiting Mirror.refresh fiber test5
Exiting Mirror.refresh fiber test
Exiting Mirror.refresh fiber test1
Exiting Mirror.refresh fiber test2
Exiting Mirror.refresh fiber test3
Exiting Mirror.refresh fiber test4
Exiting Mirror.refresh fiber test5
Exiting Mirror.refresh fiber test
Exiting Mirror.refresh fiber test1
Exiting Mirror.refresh fiber test2
Exiting Mirror.refresh fiber test4
Exiting Mirror.refresh fiber test5
Exiting Mirror.refresh fiber test3
Exiting Mirror.refresh fiber test
Exiting Mirror.refresh fiber test1
Exiting Mirror.refresh fiber test2
Exiting Mirror.refresh fiber test3
Exiting Mirror.refresh fiber test4
Exiting Mirror.refresh fiber test5
all fine. But, after some time, I see this:
Exiting Mirror.refresh fiber test
Exiting Mirror.refresh fiber test1
Exiting Mirror.refresh fiber test2
Exiting Mirror.refresh fiber test4
Exiting Mirror.refresh fiber test5
Segfault happens few seconds later.
I've just noticed: All this time I had a heavy writer that wrote only to test3
channel, and that's the channel that didn't got its refresh fiber killed.
I think it's definitevely connected to the combination of regular updates + refresh updates. Now I had the client writing to the node to test2
and test3
channels. Two restarts before the segfault, I got:
1503590083 Exiting Mirror.refresh fiber test
1503590083 Exiting Mirror.refresh fiber test1
1503590083 Exiting Mirror.refresh fiber test2
1503590083 Exiting Mirror.refresh fiber test3
1503590083 Exiting Mirror.refresh fiber test4
1503590083 Exiting Mirror.refresh fiber test5
1503590095 Exiting Mirror.refresh fiber test
1503590095 Exiting Mirror.refresh fiber test1
1503590095 Exiting Mirror.refresh fiber test4
1503590095 Exiting Mirror.refresh fiber test5
1503590097 Exiting Mirror.refresh fiber test
1503590097 Exiting Mirror.refresh fiber test1
1503590097 Exiting Mirror.refresh fiber test4
1503590097 Exiting Mirror.refresh fiber test5
1503590100 Exiting Mirror.refresh fiber test
1503590100 Exiting Mirror.refresh fiber test1
1503590100 Exiting Mirror.refresh fiber test4
1503590100 Exiting Mirror.refresh fiber test5
It's just these two channels for which the refresh fibers are not killed.
(and it took just 5 seconds between the closing the connection that left lingering refresh fibers and segfaulting, which is exactly the refresh period).
I think it's definitevely connected to the combination of regular updates + refresh updates. Now I had the client writing to the node to test2 and test3 channels. Two restarts before the segfault, I got:
This seems to actually be related not if the channels have a heavy writers, but if the channels have a lots of data in themselves, so that the actual refresh cycle takes longer.
I think it's definitevely connected to the combination of regular updates + refresh updates.
I just tried out an app build with the mirror refresh disabled: no sign of any segfaults so far. So I think you're on to something here.
Yes, I think I've managed to narrow it down to be connected only to a combination of non-empty channel and periodic refresh. I no longer think regular updates are necessary for the segfault to be seen.
Running the test with a breakpoint on gc_collect
, I got the segfault without a collection happening.
Printing out the addresses of the acquired timers and when a new timer is newed, everything seems fine and unrelated to the segfault. (One timer is newed per mirrored channel, they are always acquired once each per restart.)
Not sure if its relevant, but I just inspected the message being sent at the point of a segfault (at this line https://github.com/sociomantic-tsunami/swarm/blob/v4.x.x/src/swarm/neo/protocol/socket/MessageGenerator.d#L62):
dynamic_fields
is of length 4 with the following fields:
[0x7]
(this is the RecordRefreshed
message code)[0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0]
[0x5, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0]
(this is the length of the record value)"hello"
(this is the record value)static_fields
is of length 1 with the following fields:
[0x5, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0]
Simply allocating the timer and not using it still produces the segfault.
Looks like it's a bug in AcquiredResources
in swarm. Replacing https://github.com/sociomantic-tsunami/swarm/blob/v4.x.x/src/swarm/neo/util/AcquiredResources.d#L83 with auto buffer = new void[capacity];
fixes the segfault issue.
Slightly narrowed down: changing https://github.com/sociomantic-tsunami/swarm/blob/v4.x.x/src/swarm/neo/util/AcquiredResources.d#L93 to this.acquired = cast(T[][])(new void[(T[]).sizeof * 4]);
prevents the segfaults.
We've made some progress. It seems that the casting of acquired void[]*
s to pointers to other types array (here) is dodgy.
Replacing the code of the Mirror request's queue to work directly with a void[]*
, instead of casting it to the internal element type (i.e. E[]*
), fixes the segfaults.
Running the test with a breakpoint on gc_collect, I got the segfault without a collection happening.
gc_collect
is just a handle to start the collection on demand. Uncommenting this here (or placing a breakpoint on fullcollect
):
https://github.com/sociomantic-tsunami/tangort/blob/v1.x.x/src/gc/cdgc/gc.d#L881
Shows collections happening.
Doing GC.disable()
at the beginning of the process makes segfault go away.
This also indicates we're leaking memory somewhere, as there shouldn't be any collects. I guess it is in the MessageGenerator
, since that's where we allocate memory according to these backtraces.
This also indicates we're leaking memory somewhere, as there shouldn't be any collects. I guess it is in the MessageGenerator, since that's where we allocate memory according to these backtraces.
Applying https://github.com/sociomantic-tsunami/swarm/pull/118 makes collections very infrequent and it masks this issue.
Of course, eventually it will segfault. What I've noticed that before the segfault, AcquiredResources
can't find free elements (they are probably corrupted), so it's always allocating more and more StorageEngineStepIterators
, although there's fixed number of simultaneous clients.
Fixed by submodule update in #40.
Steps to reproduce (dhtnode and dhtproto neo-alpha-3 + https://github.com/sociomantic-tsunami/swarm/pull/115 + https://github.com/sociomantic-tsunami/swarm/pull/117):
neotest multimirror test test2 test3 test4 test5
).