servalproject / serval-dna

The Serval Project's core daemon that implements Distributed Numbering Architecture (DNA), MDP, VoMP, Rhizome, MeshMS, etc.
http://servalproject.org
Other
170 stars 81 forks source link

Serval clients have trouble finding each other #101

Open Rubyfi opened 8 years ago

Rubyfi commented 8 years ago

We're trying to run performance tests for serval using different network topologies. One of them, for example, is a chained topology where 64 nodes are set up in a way that they have two network interfaces, each of them connecting the node to one of its neighbours.

This means we have 63 independent subnets, containing two nodes each. Sometimes when we boot up the nodes, the chain separates into two parts, where the nodes contained in one part don't find the ones contained in the other part. Let's assume the chain breaks between node 40 and 41.

If this happens you have to manually execute

mdp ping <ID of 40>

on node 41 and vice versa for node 40. Usually the ping will fail from one side, but will succeed from the other and the chain will complete automatically from that point.

Even scanning every single link of the chain using

servald scan <Broadcast IP>

doesn't help.

Any ideas what could cause this issue?

We're basically logging everything that serval does. So if you need any further details we're happy to send them to you.

lakeman commented 8 years ago

Birthday paradox?

When we send SID's in packets between network peers we abbreviate them to save bandwidth. We make sure to send more than enough bits to be disambiguated, based on the prefixes of every other SID we know, rounded up to the next whole byte. We add an extra byte when encoding our own SID in the packet header. https://github.com/servalproject/serval-dna/blob/development/overlay_address.c#L281

The client API requires a full SID to be entered from the command line and sent to the local node. Where it is added to the set of known SID's, which would force more bits to be sent for any other similar SID, resolving the ambiguity.

So we need a better method to detect collisions... ?

gardners commented 8 years ago

I've made a temporary work-around for this on the development branch, that allows a keypair to be generated with a specific 4 - 16 bit prefix. The idea is that this will allow a temporary work-around and also the means to write tests for the SID prefix-collision problem.

14ed84dabc69eeca9c934578db50da5608f5612a

lakeman commented 8 years ago

How about we don't push that to the development branch, I've added a branch called "asserts" for temporary changes for issues #101 & #102

On Fri, Apr 29, 2016 at 7:21 PM, gardners notifications@github.com wrote:

I've made a temporary work-around for this on the development branch, that allows a keypair to be generated with a specific 4 - 16 bit prefix. The idea is that this will allow a temporary work-around and also the means to write tests for the SID prefix-collision problem.

14ed84d https://github.com/servalproject/serval-dna/commit/14ed84dabc69eeca9c934578db50da5608f5612a

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/servalproject/serval-dna/issues/101#issuecomment-215672967

gardners commented 8 years ago

The above commit adds a new test to the tests/routing script that tries to reproduce this problem. However, so far, the test still passes.

Lars: You can now search for multiple_ids_same_prefix in tests/routing to see what I have done, and adjust the topology to suit the one you are having trouble with, possibly just by extending the chain from 4 nodes to more, or having more nodes being neighbours of each other, until we can reproduce the problem. Post a comment if you succeed in getting it to fail.

Paul.

jonashoechst commented 8 years ago

We tried to build your commit 382694464bd4f6d3722475ae9a78a480d5c407ed with openwrt, which failed with this message:

SERVALD CC overlay_buffer.c
In file included from strbuf.h:87:0,
                 from instance.h:25,
                 from serval.h:124,
                 from overlay_buffer.c:21:
overlay_buffer.c: In function '_ob_append_byte':
overlay_buffer.c:242:13: error: comparison of unsigned expression >= 0 is always true [-Werror=type-limits]
     assert(b->position >= 0);
             ^
cc1: all warnings being treated as errors
gardners commented 8 years ago

I've just committed the deletion of that line. Please try with that.

lakeman commented 8 years ago

Triggering the birthday paradox with immediate neighbours would require at least 2 bytes of matching prefix. Waiting for keyring add to generate a one in 64k probability value four times every time the test runs would make for a very annoying test. Though we could pay the price to generate keys once and import them during the test. A configuration that is easier to reproduce would involve at least 4 devices in a line, with the ends having only a one byte prefix match. Then the middle two nodes might get confused about which peer is which. We'd probably also need to ensure that the two edge links were enabled first, then the link between the middle nodes.

As I see it there are two issues here that we can improve.

Firstly if two neighbours have a prefix collision, we would currently just silently drop the packet. This is how we deal with broadcast packets that are echoed back by the network layer. We would need to check the source address of the packet against our interfaces to detect the collision and swap full SID's.

Secondly, if there is a collision between distant network nodes, the right fix for this would involve providing a signature for each link, for validation and to make forgery more difficult. To save bandwidth, I'd suggest that the receiver of a link signs only the transmitter -> receiver pair with some long lived nonce value. When you receive a signature you must verify it before using that link in your routing table or forwarding the signature onwards. Preferring links with signatures over paths with nodes running older versions of the software. If you can't verify it, assume that you and your neighbour are trying to talk about 2 SID's with the same prefix. So send the full SID of the transmitter and receiver you decoded back to the neighbour that sent you this signature. (We already have a message specifically for this). That should be enough to fix the problem.