Segfault in GC with Mirror

gavin-norman-sociomantic commented 7 years ago

Steps to reproduce (dhtnode and dhtproto neo-alpha-3 + https://github.com/sociomantic-tsunami/swarm/pull/115 + https://github.com/sociomantic-tsunami/swarm/pull/117):

Start a dhtnode with data in 5 channels.
Start a client mirroring the channels (e.g. neotest multimirror test test2 test3 test4 test5).
Stop and start the mirroring client.

A segfault occurs here:

Program received signal SIGSEGV, Segmentation fault.
gc.cdgc.gc.malloc(ulong, uint, ulong*) (pm_bitmask=0xa73230 <internal>, attrs=638, size=140737304053504) at /home/jenkins/docker/src/gc/cdgc/gc.d:1521
1521    /home/jenkins/docker/src/gc/cdgc/gc.d: No such file or directory.
(gdb) bt
#0  gc.cdgc.gc.malloc(ulong, uint, ulong*) (pm_bitmask=0xa73230 <internal>, attrs=638, size=140737304053504) at /home/jenkins/docker/src/gc/cdgc/gc.d:1521
#1  0x0000000000766515 in gc.cdgc.gc.calloc(ulong, uint, ulong*) (pm_bitmask=0x61, attrs=97, size=97) at /home/jenkins/docker/src/gc/cdgc/gc.d:1600
#2  0x00000000007629b2 in gc.cdgc.gc.gc_calloc(ulong, uint, object.PointerMap).locked!(void*, gc.cdgc.gc.gc_calloc(ulong, uint, object.PointerMap).__dgliteral122()).locked().__dgliteral122() (this=0x7fffee1a2a30) at /home/jenkins/docker/src/gc/cdgc/gc.d:2604
#3  0x0000000000762906 in gc.cdgc.gc.gc_calloc(ulong, uint, object.PointerMap).locked!(void*, gc.cdgc.gc.gc_calloc(ulong, uint, object.PointerMap).__dgliteral122()).locked() (this=0x7fffee1a2a30) at /home/jenkins/docker/src/gc/cdgc/gc.d:268
#4  0x0000000000762581 in gc_calloc (size=97, attrs=0, ptrmap=...) at /home/jenkins/docker/src/gc/cdgc/gc.d:2599
#5  0x000000000075f939 in _d_arraysetlengthT (ti=0xa588c0 <TypeInfo_AS5swarm3neo8protocol6socket9uio_const11iovec_const.init$>, newlength=6, p=0x7ffff7eea8f0)
at /home/jenkins/docker/src/rt/lifetime.d:756
#6  0x00000000006e184d in swarm.neo.protocol.socket.MessageGenerator.IoVecMessage.setup(swarm.neo.protocol.Message.MessageType, void[][], void[][]...) (
this=0x7ffff7eea8f0, static_fields=..., dynamic_fields=..., type=1 '\001') at ./submodules/swarm/src/swarm/neo/protocol/socket/MessageGenerator.d:62
#7  0x000000000069325b in swarm.neo.protocol.socket.MessageSender.MessageSender.assign(swarm.neo.protocol.Message.MessageType, void[][], void[][]...) (
this=0x7ffff7eea800, static_fields=..., dynamic_fields=..., type=1 '\001') at ./submodules/swarm/src/swarm/neo/protocol/socket/MessageSender.d:166
#8  0x000000000070eebf in swarm.neo.connection.ConnectionBase.ConnectionBase.SendLoop.sendRequestPayload(ulong).__dgliteral567(void[][]) (
this=0x7fffee1a2dd0, payload=...) at ./submodules/swarm/src/swarm/neo/connection/ConnectionBase.d:423
#9  0x00000000006e03f4 in swarm.neo.connection.RequestOnConnBase.RequestOnConnBase.getPayloadForSending(void(void[][]) delegate) (this=0x7ffff4fdca00, 
send=...) at ./submodules/swarm/src/swarm/neo/connection/RequestOnConnBase.d:1734
#10 0x0000000000751eb3 in swarm.neo.node.RequestSet.RequestSet.Request.getPayloadForSending(void(void[][]) delegate) (this=0x7ffff4fdca00, send=...)
at ./submodules/swarm/src/swarm/neo/node/RequestSet.d:148
#11 0x0000000000727996 in swarm.neo.node.Connection.Connection.getPayloadForSending(ulong, void(void[][]) delegate) (this=0x7ffff4f84f00, send=..., id=2)
at ./submodules/swarm/src/swarm/neo/node/Connection.d:216
#12 0x000000000070edca in swarm.neo.connection.ConnectionBase.ConnectionBase.SendLoop.sendRequestPayload(ulong) (this=0x7ffff4f84a00, id=2)
at ./submodules/swarm/src/swarm/neo/connection/ConnectionBase.d:419
#13 0x000000000070ed7e in swarm.neo.connection.ConnectionBase.ConnectionBase.SendLoop.loop().__foreachbody1902(ref ulong) (this=0x7fffee1a2f18, 
__applyArg0=0x7fffee1a2e70) at ./submodules/swarm/src/swarm/neo/connection/ConnectionBase.d:387
#14 0x00000000006e251b in swarm.neo.util.TreeQueue.TreeQueueCore.opApply(int(ref ulong) delegate) (this=0x7ffff4f84a60, dg=...)
at ./submodules/swarm/src/swarm/neo/util/TreeQueue.d:462
#15 0x000000000070ecd7 in swarm.neo.connection.ConnectionBase.ConnectionBase.SendLoop.loop() (this=0x7ffff4f84a00)
at ./submodules/swarm/src/swarm/neo/connection/ConnectionBase.d:386
#16 0x000000000070eaec in swarm.neo.connection.ConnectionBase.ConnectionBase.SendLoop.fiberMethod() (this=0x7ffff4f84a00)
at ./submodules/swarm/src/swarm/neo/connection/ConnectionBase.d:320
#17 0x000000000076b6ae in core.thread.Fiber.run() (this=0x5ed) at /home/jenkins/docker/src/core/thread.d:3200
#18 0x000000000076b608 in fiber_entryPoint () at /home/jenkins/docker/src/core/thread.d:2489
#19 0x0000000000000000 in ?? ()

gavin-norman-sociomantic commented 7 years ago

The point is that GC malloc tries to allocate 140737304053504 bytes.

mathias-lang-sociomantic commented 7 years ago

Looking at the stack trace:

1 0x0000000000766515 in gc.cdgc.gc.calloc(ulong, uint, ulong*) (pm_bitmask=0x61, attrs=97, size=97) at /home/jenkins/docker/src/gc/cdgc/gc.d:1600

The file / line are wrong, but that function is here. It passes size to malloc verbatim, so how come gdb shows:

gc.cdgc.gc.malloc(ulong, uint, ulong*) (pm_bitmask=0xa73230 , attrs=638, size=140737304053504) at /home/jenkins/docker/src/gc/cdgc/gc.d:1521

Either gdb is wrong and could make us look at the wrong place (wouldn't be surprised), memory is corrupted, or the generated code is wrong and we're hitting a special case.

Honestly, I wouldn't be surprised if that stack trace turned to be wrong and the GC has nothing to do with it.

joseph-wakeling-sociomantic commented 7 years ago

@gavin-norman-sociomantic what tango runtime version/build are you using when generating this stacktrace?

I ask because when running with a debug build of tango (thanks to @nemanja-boric-sociomantic), I get a different (but clearly similar) stacktrace:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  gc.cdgc.gc.malloc(ulong, uint, ulong*) (pm_bitmask=0xa73a30 <internal>, attrs=0, size=97) at /home/nemanjaboric/work/tangort/src/gc/cdgc/gc.d:1520
1520    /home/nemanjaboric/work/tangort/src/gc/cdgc/gc.d: No such file or directory.
(gdb) bt
#0  gc.cdgc.gc.malloc(ulong, uint, ulong*) (pm_bitmask=0xa73a30 <internal>, attrs=0, size=97) at /home/nemanjaboric/work/tangort/src/gc/cdgc/gc.d:1520
#1  0x000000000076810f in gc.cdgc.gc.calloc(ulong, uint, ulong*) (pm_bitmask=0xa73a30 <internal>, attrs=0, size=97) at /home/nemanjaboric/work/tangort/src/gc/cdgc/gc.d:1599
#2  0x0000000000763940 in gc.cdgc.gc.gc_calloc(ulong, uint, object.PointerMap).locked!(void*, gc.cdgc.gc.gc_calloc(ulong, uint, object.PointerMap).__dgliteral123()).locked().__dgliteral123() (this=0x7f8f798c5a10) at /home/nemanjaboric/work/tangort/src/gc/cdgc/gc.d:2604
#3  0x00000000007638b4 in gc.cdgc.gc.gc_calloc(ulong, uint, object.PointerMap).locked!(void*, gc.cdgc.gc.gc_calloc(ulong, uint, object.PointerMap).__dgliteral123()).locked() (this=0x7f8f798c5a10) at /home/nemanjaboric/work/tangort/src/gc/cdgc/gc.d:268
#4  0x0000000000763586 in gc_calloc (size=97, attrs=0, ptrmap=...) at /home/nemanjaboric/work/tangort/src/gc/cdgc/gc.d:2599
#5  0x000000000075b3e6 in _d_arraysetlengthT (ti=0xa5a640 <TypeInfo_AS5swarm3neo8protocol6socket9uio_const11iovec_const.init$>, newlength=6, p=0x7f8f7eba42f0) at /home/nemanjaboric/work/tangort/src/rt/lifetime.d:756
#6  0x00000000006e1409 in swarm.neo.protocol.socket.MessageGenerator.IoVecMessage.setup(swarm.neo.protocol.Message.MessageType, void[][], void[][]...) (this=0x7f8f7eba42f0, static_fields=..., dynamic_fields=..., type=1 '\001')
    at ./submodules/swarm/src/swarm/neo/protocol/socket/MessageGenerator.d:62
#7  0x0000000000692e5f in swarm.neo.protocol.socket.MessageSender.MessageSender.assign(swarm.neo.protocol.Message.MessageType, void[][], void[][]...) (this=0x7f8f7eba4200, static_fields=..., dynamic_fields=..., type=1 '\001')
    at ./submodules/swarm/src/swarm/neo/protocol/socket/MessageSender.d:166
#8  0x000000000070e7e3 in swarm.neo.connection.ConnectionBase.ConnectionBase.SendLoop.sendRequestPayload(ulong).__dgliteral567(void[][]) (this=0x7f8f798c5dc0, payload=...) at ./submodules/swarm/src/swarm/neo/connection/ConnectionBase.d:423
#9  0x00000000006dffb0 in swarm.neo.connection.RequestOnConnBase.RequestOnConnBase.getPayloadForSending(void(void[][]) delegate) (this=0x7f8f7c534600, send=...) at ./submodules/swarm/src/swarm/neo/connection/RequestOnConnBase.d:1734
#10 0x0000000000751507 in swarm.neo.node.RequestSet.RequestSet.Request.getPayloadForSending(void(void[][]) delegate) (this=0x7f8f7c534600, send=...) at ./submodules/swarm/src/swarm/neo/node/RequestSet.d:148
#11 0x0000000000726fea in swarm.neo.node.Connection.Connection.getPayloadForSending(ulong, void(void[][]) delegate) (this=0x7f8f7c50bb00, send=..., id=11) at ./submodules/swarm/src/swarm/neo/node/Connection.d:216
#12 0x000000000070e6ee in swarm.neo.connection.ConnectionBase.ConnectionBase.SendLoop.sendRequestPayload(ulong) (this=0x7f8f7c50b600, id=11) at ./submodules/swarm/src/swarm/neo/connection/ConnectionBase.d:419
#13 0x000000000070e6a2 in swarm.neo.connection.ConnectionBase.ConnectionBase.SendLoop.loop().__foreachbody1902(ref ulong) (this=0x7f8f798c5f08, __applyArg0=0x7f8f798c5e60) at ./submodules/swarm/src/swarm/neo/connection/ConnectionBase.d:387
#14 0x00000000006e20d7 in swarm.neo.util.TreeQueue.TreeQueueCore.opApply(int(ref ulong) delegate) (this=0x7f8f7c50b660, dg=...) at ./submodules/swarm/src/swarm/neo/util/TreeQueue.d:462
#15 0x000000000070e5fb in swarm.neo.connection.ConnectionBase.ConnectionBase.SendLoop.loop() (this=0x7f8f7c50b600) at ./submodules/swarm/src/swarm/neo/connection/ConnectionBase.d:386
#16 0x000000000070e55f in swarm.neo.connection.ConnectionBase.ConnectionBase.SendLoop.fiberMethod() (this=0x7f8f7c50b600) at ./submodules/swarm/src/swarm/neo/connection/ConnectionBase.d:320
#17 0x000000000076cfe3 in core.thread.Fiber.run() (this=0x7f8f7c4f4280) at /home/nemanjaboric/work/tangort/src/core/thread.d:3200
#18 0x000000000076cf01 in fiber_entryPoint () at /home/nemanjaboric/work/tangort/src/core/thread.d:2489
#19 0x0000000000000000 in ?? ()

gavin-norman-sociomantic commented 7 years ago

Starting and stopping a writer client at the same time as the Mirrors are going doesn't cause crashes.

gavin-norman-sociomantic commented 7 years ago

what tango runtime version/build are you using when generating this stacktrace?

I'm using the standard (release) runtime.

gavin-norman-sociomantic commented 7 years ago

Testing this, I just got a different segfault:

Program received signal SIGSEGV, Segmentation fault.
0x0000000000672e94 in dhtnode.request.neo.Mirror.MirrorImpl_v0.startIteration() (this=0x7fffe4803d78) at ./src/dhtnode/request/neo/Mirror.d:199
199         this.iterator.setStorage(this.channel);
(gdb) bt
#0  0x0000000000672e94 in dhtnode.request.neo.Mirror.MirrorImpl_v0.startIteration() (this=0x7fffe4803d78) at ./src/dhtnode/request/neo/Mirror.d:199
#1  0x0000000000689a36 in dhtproto.node.neo.request.Mirror.MirrorProtocol_v0.PeriodicRefresh.refresh() (this=0x7fffe4803be0)
    at ./submodules/dhtproto/src/dhtproto/node/neo/request/Mirror.d:996
#2  0x0000000000689856 in dhtproto.node.neo.request.Mirror.MirrorProtocol_v0.PeriodicRefresh.fiberMethod() (this=0x7fffe4803be0)
    at ./submodules/dhtproto/src/dhtproto/node/neo/request/Mirror.d:945
#3  0x000000000076b6ae in core.thread.Fiber.run() (this=0xa56540 <vtable for ocean.io.Stdout.TerminalOutput!(char).TerminalOutput>)
    at /home/jenkins/docker/src/core/thread.d:3200
#4  0x000000000076b608 in fiber_entryPoint () at /home/jenkins/docker/src/core/thread.d:2489
#5  0x0000000000000000 in ?? ()

Not sure if this is related.

nemanja-boric-sociomantic commented 7 years ago

https://github.com/sociomantic-tsunami/dhtnode/issues/12

nemanja-boric-sociomantic commented 7 years ago

I managed to get this one as well (also in GC, but not the same as our first):

(gdb) bt
#0  0x00000000007696c9 in gc.cdgc.bits.GCBits.clear(ulong) (this=0x101, 
    i=140737264569472) at /home/jenkins/docker/src/gc/cdgc/bits.d:131
#1  0x00000000000005ed in ?? ()
#2  0x00007ffff50b0040 in ?? ()
#3  0x00000000002e345c in ?? ()
#4  0x0000000000765597 in gc.cdgc.gc.malloc(ulong, uint, ulong*) (
    pm_bitmask=0xa71f90 <internal>, attrs=94626, size=28)
    at /home/jenkins/docker/src/gc/cdgc/gc.d:1523
#5  0x0000000000761d0a in gc.cdgc.gc.gc_malloc(ulong, uint, object.PointerMap).locked!(void*, gc.cdgc.gc.gc_malloc(ulong, uint, object.PointerMap).__dgliteral121()).locked().__dgliteral121() (this=0x7ffff2a94d00) at /home/jenkins/docker/src/gc/cdgc/gc.d:2590
#6  0x0000000000761c5e in gc.cdgc.gc.gc_malloc(ulong, uint, object.PointerMap).locked!(void*, gc.cdgc.gc.gc_malloc(ulong, uint, object.PointerMap).__dgliteral121()).locked() (
    this=0x7ffff2a94d00) at /home/jenkins/docker/src/gc/cdgc/gc.d:268
#7  0x0000000000761ae5 in gc_malloc (size=257, attrs=2, ptrmap=...)
    at /home/jenkins/docker/src/gc/cdgc/gc.d:2585
#8  0x000000000075f382 in _d_arrayappendcTX (ti=0x20, x=0x266b752d74636572, n=1)
    at /home/jenkins/docker/src/rt/lifetime.d:946
#9  0x00000000006898f9 in dhtproto.node.neo.request.Mirror.MirrorProtocol_v0.Queue!(ulong).Queue.push(ulong) (this=0x7ffff2ac7da8, e=1495586554)
    at ./submodules/dhtproto/src/dhtproto/node/neo/request/Mirror.d:98
#10 0x00000000006887a5 in dhtproto.node.neo.request.Mirror.MirrorProtocol_v0.refreshed(ulong) (this=0x7ffff2ac7d78, key=1495586554)
    at ./submodules/dhtproto/src/dhtproto/node/neo/request/Mirror.d:384
#11 0x00000000006894e2 in dhtproto.node.neo.request.Mirror.MirrorProtocol_v0.PeriodicRefresh.refresh() (this=0x7ffff2ac7be0)
    at ./submodules/dhtproto/src/dhtproto/node/neo/request/Mirror.d:998
#12 0x00000000006892c4 in dhtproto.node.neo.request.Mirror.MirrorProtocol_v0.PeriodicRefresh.fiberMethod() (this=0x7ffff2ac7be0)
    at ./submodules/dhtproto/src/dhtproto/node/neo/request/Mirror.d:930
#13 0x000000000076ac3e in core.thread.Fiber.run() (this=0x266b752d74636572)
    at /home/jenkins/docker/src/core/thread.d:3200
#14 0x000000000076ab98 in fiber_entryPoint ()
    at /home/jenkins/docker/src/core/thread.d:2489
#15 0x0000000000000000 in ?? ()
(gdb)

nemanja-boric-sociomantic commented 7 years ago

OK, this seems to get it a bit closer. I've put the scope (exit) Stderr.formatln("exiting refresh {}", this.outer.channel) here:

https://github.com/sociomantic-tsunami/dhtproto/blob/neo/src/dhtproto/node/neo/request/Mirror.d#L952

Now, go and start starting stopping the mirror requests:


Exiting Mirror.refresh fiber test
Exiting Mirror.refresh fiber test1
Exiting Mirror.refresh fiber test2
Exiting Mirror.refresh fiber test3
Exiting Mirror.refresh fiber test4
Exiting Mirror.refresh fiber test5
Exiting Mirror.refresh fiber test
Exiting Mirror.refresh fiber test1
Exiting Mirror.refresh fiber test2
Exiting Mirror.refresh fiber test3
Exiting Mirror.refresh fiber test4
Exiting Mirror.refresh fiber test5
Exiting Mirror.refresh fiber test
Exiting Mirror.refresh fiber test1
Exiting Mirror.refresh fiber test2
Exiting Mirror.refresh fiber test3
Exiting Mirror.refresh fiber test4
Exiting Mirror.refresh fiber test5
Exiting Mirror.refresh fiber test
Exiting Mirror.refresh fiber test1
Exiting Mirror.refresh fiber test2
Exiting Mirror.refresh fiber test3
Exiting Mirror.refresh fiber test4
Exiting Mirror.refresh fiber test5
Exiting Mirror.refresh fiber test
Exiting Mirror.refresh fiber test1
Exiting Mirror.refresh fiber test2
Exiting Mirror.refresh fiber test4
Exiting Mirror.refresh fiber test5
Exiting Mirror.refresh fiber test3
Exiting Mirror.refresh fiber test
Exiting Mirror.refresh fiber test1
Exiting Mirror.refresh fiber test2
Exiting Mirror.refresh fiber test3
Exiting Mirror.refresh fiber test4
Exiting Mirror.refresh fiber test5

all fine. But, after some time, I see this:

Exiting Mirror.refresh fiber test
Exiting Mirror.refresh fiber test1
Exiting Mirror.refresh fiber test2
Exiting Mirror.refresh fiber test4
Exiting Mirror.refresh fiber test5

Segfault happens few seconds later.

nemanja-boric-sociomantic commented 7 years ago

I've just noticed: All this time I had a heavy writer that wrote only to test3 channel, and that's the channel that didn't got its refresh fiber killed.

nemanja-boric-sociomantic commented 7 years ago

I think it's definitevely connected to the combination of regular updates + refresh updates. Now I had the client writing to the node to test2 and test3 channels. Two restarts before the segfault, I got:

1503590083 Exiting Mirror.refresh fiber test
1503590083 Exiting Mirror.refresh fiber test1
1503590083 Exiting Mirror.refresh fiber test2
1503590083 Exiting Mirror.refresh fiber test3
1503590083 Exiting Mirror.refresh fiber test4
1503590083 Exiting Mirror.refresh fiber test5
1503590095 Exiting Mirror.refresh fiber test
1503590095 Exiting Mirror.refresh fiber test1
1503590095 Exiting Mirror.refresh fiber test4
1503590095 Exiting Mirror.refresh fiber test5
1503590097 Exiting Mirror.refresh fiber test
1503590097 Exiting Mirror.refresh fiber test1
1503590097 Exiting Mirror.refresh fiber test4
1503590097 Exiting Mirror.refresh fiber test5
1503590100 Exiting Mirror.refresh fiber test
1503590100 Exiting Mirror.refresh fiber test1
1503590100 Exiting Mirror.refresh fiber test4
1503590100 Exiting Mirror.refresh fiber test5

It's just these two channels for which the refresh fibers are not killed.

nemanja-boric-sociomantic commented 7 years ago

(and it took just 5 seconds between the closing the connection that left lingering refresh fibers and segfaulting, which is exactly the refresh period).

nemanja-boric-sociomantic commented 7 years ago

I think it's definitevely connected to the combination of regular updates + refresh updates. Now I had the client writing to the node to test2 and test3 channels. Two restarts before the segfault, I got:

This seems to actually be related not if the channels have a heavy writers, but if the channels have a lots of data in themselves, so that the actual refresh cycle takes longer.

joseph-wakeling-sociomantic commented 7 years ago

I think it's definitevely connected to the combination of regular updates + refresh updates.

I just tried out an app build with the mirror refresh disabled: no sign of any segfaults so far. So I think you're on to something here.

nemanja-boric-sociomantic commented 7 years ago

Yes, I think I've managed to narrow it down to be connected only to a combination of non-empty channel and periodic refresh. I no longer think regular updates are necessary for the segfault to be seen.

gavin-norman-sociomantic commented 7 years ago

Running the test with a breakpoint on gc_collect, I got the segfault without a collection happening.

gavin-norman-sociomantic commented 7 years ago

Printing out the addresses of the acquired timers and when a new timer is newed, everything seems fine and unrelated to the segfault. (One timer is newed per mirrored channel, they are always acquired once each per restart.)

gavin-norman-sociomantic commented 7 years ago

Not sure if its relevant, but I just inspected the message being sent at the point of a segfault (at this line https://github.com/sociomantic-tsunami/swarm/blob/v4.x.x/src/swarm/neo/protocol/socket/MessageGenerator.d#L62):

dynamic_fields is of length 4 with the following fields:

[0x7] (this is the RecordRefreshed message code)
[0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0]
[0x5, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0] (this is the length of the record value)
"hello" (this is the record value)

static_fields is of length 1 with the following fields:

[0x5, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0]

gavin-norman-sociomantic commented 7 years ago

Simply allocating the timer and not using it still produces the segfault.

gavin-norman-sociomantic commented 7 years ago

Looks like it's a bug in AcquiredResources in swarm. Replacing https://github.com/sociomantic-tsunami/swarm/blob/v4.x.x/src/swarm/neo/util/AcquiredResources.d#L83 with auto buffer = new void[capacity]; fixes the segfault issue.

gavin-norman-sociomantic commented 7 years ago

Slightly narrowed down: changing https://github.com/sociomantic-tsunami/swarm/blob/v4.x.x/src/swarm/neo/util/AcquiredResources.d#L93 to this.acquired = cast(T[][])(new void[(T[]).sizeof * 4]); prevents the segfaults.

gavin-norman-sociomantic commented 7 years ago

We've made some progress. It seems that the casting of acquired void[]*s to pointers to other types array (here) is dodgy.

Replacing the code of the Mirror request's queue to work directly with a void[]*, instead of casting it to the internal element type (i.e. E[]*), fixes the segfaults.

nemanja-boric-sociomantic commented 7 years ago

Running the test with a breakpoint on gc_collect, I got the segfault without a collection happening.

gc_collect is just a handle to start the collection on demand. Uncommenting this here (or placing a breakpoint on fullcollect):

https://github.com/sociomantic-tsunami/tangort/blob/v1.x.x/src/gc/cdgc/gc.d#L881

Shows collections happening.

Doing GC.disable() at the beginning of the process makes segfault go away.

This also indicates we're leaking memory somewhere, as there shouldn't be any collects. I guess it is in the MessageGenerator, since that's where we allocate memory according to these backtraces.

nemanja-boric-sociomantic commented 7 years ago

This also indicates we're leaking memory somewhere, as there shouldn't be any collects. I guess it is in the MessageGenerator, since that's where we allocate memory according to these backtraces.

https://github.com/sociomantic-tsunami/swarm/pull/118

nemanja-boric-sociomantic commented 7 years ago

Applying https://github.com/sociomantic-tsunami/swarm/pull/118 makes collections very infrequent and it masks this issue.

nemanja-boric-sociomantic commented 7 years ago

Of course, eventually it will segfault. What I've noticed that before the segfault, AcquiredResources can't find free elements (they are probably corrupted), so it's always allocating more and more StorageEngineStepIterators, although there's fixed number of simultaneous clients.

gavin-norman-sociomantic commented 7 years ago

Fixed by https://github.com/sociomantic-tsunami/dhtproto/pull/26.

gavin-norman-sociomantic commented 7 years ago

Fixed by submodule update in #40.

sociomantic-tsunami / dhtnode

Segfault in GC with Mirror #15

1 0x0000000000766515 in gc.cdgc.gc.calloc(ulong, uint, ulong*) (pm_bitmask=0x61, attrs=97, size=97) at /home/jenkins/docker/src/gc/cdgc/gc.d:1600