sociomantic-tsunami / swarm

Asynchronous client/node framework library
Boost Software License 1.0
14 stars 26 forks source link

Fix crash when a request is finished during a throttledTaskPool resume #388

Closed bogdan-szabo-sociomantic closed 5 years ago

bogdan-szabo-sociomantic commented 5 years ago

It looks like the map is cleared immediately after a request is done. If this happens during ISuspendableThrottler.throttledResume() which iterates through the map, the app will crash.

gavin-norman-sociomantic commented 5 years ago

Could you explain how you tracked this down to this method? It'd be helpful to know more details about what happens when you call throttledResume, and what the call stack is that ends up in this method.

bogdan-szabo-sociomantic commented 5 years ago

I tracked this for a couple of days.

I inspected the this.map for the size, and I discovered that all nodes are removed during the loop.

This is the stack:

core.exception.AssertError@./submodules/ocean/src/ocean/util/container/ebtree/c/eb64tree.d(39): THE LEAF_P CAN NOT BE NULL!!!!! <<<<< THE CRASH IS HERE!!!
----------------
src/core/exception.d:438 _d_assert_msg [0x8fdcc1]
./submodules/ocean/src/ocean/util/container/ebtree/c/eb64tree.d:39 ocean.util.container.ebtree.c.eb64tree.eb64_node* ocean.util.container.ebtree.c.eb64tree.eb64_node.next() [0x805218]
./submodules/swarm/src/swarm/neo/util/TreeMap.d:303 int swarm.neo.util.TreeMap.TreeMap!(swarm.neo.client.RequestOnConn.RequestOnConn.TreeMapElement).TreeMap.opApply(scope int delegate(ref swarm.neo.client.RequestOnConn.RequestOnConn)) [0x83f599]
./submodules/swarm/src/swarm/neo/client/RequestOnConnSet.d:199 int swarm.neo.client.RequestOnConnSet.RequestOnConnSet.opApply(scope int delegate(ref swarm.neo.client.RequestOnConn.RequestOnConn)) [0x8101c8]
./submodules/swarm/src/swarm/neo/client/RequestSet.d:475 void swarm.neo.client.RequestSet.RequestSet.Request.resumeSuspendedHandlers(int) [0x8bbf15]
./submodules/swarm/src/swarm/neo/client/mixins/BatchRequestCore.d:180 bool dlsproto.client.request.internal.GetRange.GetRange.__mixin9.Controller.resume() [0x832c4c]
./submodules/swarm/src/swarm/neo/client/mixins/Controllers.d:266 _D8dlsproto6client9DlsClient9DlsClient9__mixin383Neo9__mixin1665__T11SuspendableTC8dlsproto6client7request8GetRange11IControllerZ11Suspendable6resumeMFZ9__lambda1MFC8dlsproto6client7request8GetRange11IControllerZv [0x7dacd2]
./submodules/swarm/src/swarm/neo/client/mixins/ClientCore.d:867 bool dlsproto.client.DlsClient.DlsClient.__mixin38.Neo.__mixin15.controlImpl!(dlsproto.client.request.internal.GetRange.GetRange, dlsproto.client.request.GetRange.IController).controlImpl(ulong, scope void delegate(dlsproto.client.request.GetRange.IController)) [0x7db56a]
./submodules/dlsproto/src/dlsproto/client/mixins/NeoSupport.d:329 bool dlsproto.client.DlsClient.DlsClient.__mixin38.Neo.control!(dlsproto.client.request.GetRange.IController).control(ulong, scope void delegate(dlsproto.client.request.GetRange.IController)) [0x7db4be]
./submodules/swarm/src/swarm/neo/client/mixins/Controllers.d:134 void dlsproto.client.DlsClient.DlsClient.__mixin38.Neo.__mixin16.Controller!(dlsproto.client.request.GetRange.IController).Controller.control(scope void delegate(dlsproto.client.request.GetRange.IController)) [0x7db080]
./submodules/swarm/src/swarm/neo/client/mixins/Controllers.d:263 void dlsproto.client.DlsClient.DlsClient.__mixin38.Neo.__mixin16.Suspendable!(dlsproto.client.request.GetRange.IController).Suspendable.resume() [0x7dacb0]
./submodules/ocean/src/ocean/io/model/ISuspendableThrottler.d:230 void ocean.io.model.ISuspendableThrottler.ISuspendableThrottler.resumeAll() [0x8bfa20]
./submodules/ocean/src/ocean/io/model/ISuspendableThrottler.d:187 void ocean.io.model.ISuspendableThrottler.ISuspendableThrottler.throttledResume() [0x8bf9ae]
./submodules/ocean/src/ocean/task/ThrottledTaskPool.d:212 void ocean.task.ThrottledTaskPool.ThrottledTaskPool!(*****.map.MapTask.MapTask).ThrottledTaskPool.throttlingHook() [0x7dc81d]
./submodules/ocean/src/ocean/task/Task.d:473 bool ocean.task.Task.Task.entryPoint() [0x8d4beb]
./submodules/ocean/src/ocean/task/internal/FiberPoolWithQueue.d:175 _D5ocean4task8internal18FiberPoolWithQueue18FiberPoolWithQueue17workerFiberMethodMFZ7runTaskMFZv [0x8aeb08]
./submodules/ocean/src/ocean/task/internal/FiberPoolWithQueue.d:200 void ocean.task.internal.FiberPoolWithQueue.FiberPoolWithQueue.workerFiberMethod() [0x8aea94]
src/core/thread.d:4277 void core.thread.Fiber.run() [0x9093b1]
src/core/thread.d:3523 fiber_entryPoint [0x909293]
??:? [0xffffffff]
matthias-wende-sociomantic commented 5 years ago

It looks like the map is cleared immediately after a request is done. If this happens during ISuspendableThrottler.throttledResume() which iterates through the map, the app will crash.

I think it's not possible to finish a request at the same time as resuming the request fibers? Keep in mind, we are using corporative multi-threading. It seems to me that this bug needs further investigation.

matthias-wende-sociomantic commented 5 years ago

I think it's not possible to finish a request at the same time as resuming the request fibers? Keep in mind, we are using corporative multi-threading. It seems to me that this bug needs further investigation.

Resuming an suspendable neo allnode request, using the corresponding ISuspendableinstance iterates over a set of RoCs calling resumeFiber for each RoC. Thus I've reconsidered my former assumption and think it might be possible, maybe depending on the individual request implementation, that a request finishes during the iteration.

bogdan-szabo-sociomantic commented 5 years ago

@matthias-wende-sociomantic I updated the pr with your proposal. I will not add the assert here, since it looks more like a hack...

matthias-wende-sociomantic commented 5 years ago

Given an AllNodeRequests. Iterating over the RoCs map (src/swarm/neo/client/RequestOnConnSet.d), using the opApply call, operates on each RoC individually using a caller provided delegate. Therefore it might happen that the RoC fiber is resumed – i.e. it will be immediately jumped to the fiber method, before the iteration is finished.

Whenever a RoC is finished the handlerFinished (src/swarm/neo/client/RequestSet.d) method is called and if this is the last non finished RoC, then also RequestOnConnSet.reset() is invoked which removes each RoC from the RoCs map.

It can now happen that during an iteration all but one RoCs are finished – i.e. RequestOnConnSet.finished() is called, and that the operation on the last remaining RoC causes it to finish as well.

Returning to the iteration shouldn't cause an error since the used TreeMap is expected to deal with changing it's element during an iteration.

Somehow it seemes that the last assumption doesn't hold true and happens that the iteration SegFaults.

To avoid this segfault this fix stops the iteration when all RoCs are finished.