servalproject / serval-dna

The Serval Project's core daemon that implements Distributed Numbering Architecture (DNA), MDP, VoMP, Rhizome, MeshMS, etc.
http://servalproject.org
Other
171 stars 80 forks source link

servald scan crashing #98

Closed gh0st42 closed 8 years ago

gh0st42 commented 8 years ago

While doing large scale tests with serval-dna we encountered a strange bug. In one setup where we have a 8x8 (later 10x10) grid of nodes some nodes keep crashing. When testing in core-network we still need to use "servald scan" for whatever reason because servald doesn't detect the network setup. So far so good, it still works with the scan option. But from the 8x8=64 nodes ~9 or so always crash. It seems to be related to the scan option, we have a similar behavior in our qemu based miniworld testbed, there we don't use the scan option, will report this separately after looking some more into it.

A complete log is attached. serval-20160314130000.log.txt

lakeman commented 8 years ago

First thought, what log output do you get if you set debug.overlaybuffer on?

Looking into the stack trace, since ob_overrun returns true if position > sizelimit, I don't think we should assert the opposite here; https://github.com/servalproject/serval-dna/blob/d0da910b19e05bfbfe87a29bc72b4c7f4505e212/overlay_buffer.c#L170

That seems fishy, and may be a false positive. I mean the whole point of having an ob_overrun method is to allow the position to be after the limit. So it doesn't make sense to assert that we never have a position after the limit.

I think we can simply remove that assert.

Andrew, I think you should chime in here?

On Mon, Mar 14, 2016 at 11:03 PM, gh0st42 notifications@github.com wrote:

While doing large scale tests with serval-dna we encountered a strange bug. In one setup where we have a 8x8 (later 10x10) grid of nodes some nodes keep crashing. When testing in core-network we still need to use "servald scan" for whatever reason because servald doesn't detect the network setup. So far so good, it still works with the scan option. But from the 8x8=64 nodes ~9 or so always crash. It seems to be related to the scan option, we have a similar behavior in our qemu based miniworld testbed, there we don't use the scan option, will report this separately after looking some more into it.

A complete log is attached. serval-20160314130000.log.txt https://github.com/servalproject/serval-dna/files/171918/serval-20160314130000.log.txt

— Reply to this email directly or view it on GitHub https://github.com/servalproject/serval-dna/issues/98.

gh0st42 commented 8 years ago

Okay, debug.overlaybuffer is really a killer for my machine with 8x8 nodes :D but here is the 12mb log file attached, hope it helps.

serval-20160314140000.log.txt