servalproject / serval-dna

The Serval Project's core daemon that implements Distributed Numbering Architecture (DNA), MDP, VoMP, Rhizome, MeshMS, etc.
http://servalproject.org
Other
170 stars 81 forks source link

RFC: Fix rhizome synchronisation #120

Closed b0661 closed 6 years ago

b0661 commented 7 years ago

Request for comment

Rizome sync does not properly synchronize bundles that are added to a running daemon. It also does not resynchronize broken bundles while the daemon is running (see issue #117). These changes make that working. The changes pass all rhizomeprotocol tests.

Please review an give comments how to make that acceptable to upstream.

The changes on rhizome_sync.c, rhizome_sync_bars.c, rhizome_sync_keys.c are best viewed on the final file. The changes to sqlite may be just ignored.

Solution

Both rhizome snyc protocol versions (version 0 == rhizome sync bars, version 1 == rhizome sync keys) are extended to handle broken bundles during daemon runtime. The version 0 protocol is kept the same. For the version 1 protocol only the synchronisation key announcement was extended by also providing the new synchronisation tree sequence number. Current daemons will just ignore that.

Changes

rhizome_sync.c was split into rhizome_sync.c and rhizome_sync_bars.c. rhizome_sync.c holds the general synchronisation mechanismus that is independent of the protocol versions. rhizome_sync_bars.c provides the synchronisation functionality related to the version 0 protocol. rhizome_sync_keys.c is still repsonsible for the version 1 protocol.

The selection of the protocol version can either be done by configuration or it is automatically selected (highest available version is selected).

There are a great number of renames in the code to hopefully support the understanding of the codes functionality. In general a ..bars.., .._BARS__, was added where the function or variable or define relates to protocol version 0.

Rhizome_sync.c now is the central control for synchronisation and calls into rhizome_sync_bars/keys.c for the protocol specific functions. The following functionality is implemented:

Rhizome_sync_bars.c was extended by state info for announcements and broken payload. This way newly added bundles now trigger an announcement and broken bundles can trigger a refetch.

Rhizome_sync_keys.s was extended by a sequence number for the synchronisation tree. Detection of broken bundles now force the rebuild of the whole synchronisation tree and the increment of the sequence number. All peers can detect that and also rebuild their synchronisation tree with respect to the peer with the broken bundle(s).

lakeman commented 7 years ago

This monolithic patch makes it quite difficult to separate where you have simply renamed or moved code, attempted to add documentation, vs where you have added functionality.

Some high level comments;

The "sync bars" protocol should be considered deprecated. We should stop sending these packets, unless we receive one from a peer first. Or only if explicitly configured for testing purposes. Cutting out this background file I/O should help with battery life etc. Ideally we would test running the old version of the code to really verify compatibility, but we don't have to automate that now.

Having a hard coded transfer limit is not a good idea. One of the core design criteria for rhizome, is that important content travels fast. When we discover that a peer has something more important that we need, we must pause everything else and transfer that first. If we discover transfer_max_init videos of cats before we discover an SOS message, we must still transfer that emergency message first.

The problem I think you are trying to solve here, is restarting failed transfers. You need to bear in mind that some transfers are rejected because the content is not interesting, and these rejected transfers should not be retried. IMHO it should be the receiver of the transfer that keeps track of other failures and attempts to restart them.

Yes I think we do need some kind of sync token that can be used to wipe the peer's sync state. Mostly to handle software restarts. The key sync protocol will only drill down the tree to identify differences that both peers cannot yet explain. Once a node understands the difference, it will stop responding to a peer's attempts to query that part of the tree. However I think this is best achieved by sending each peer a copy of their sync_peer_state->root->message. Both parties should agree on the contents of this node, if they don't they should throw the state away and start again. This should only be done when we aren't making progress, and have nothing more important to send.

With that in place, the receiver side can have confidence that enumerating the keys in this tree will correctly reflect the manifests they don't have, remember which onces should be transfered or were found to be uninteresting. So that transfers can be resumed after any failures.

b0661 commented 7 years ago

Thank you for your comment.

This monolithic patch makes it quite difficult to separate where you have simply renamed or moved code, attempted to add documentation, vs where you have added functionality.

You are right, I failed to create a patch sequence that helps to understand - and I still do not know how to do. My intention was to get feedback on the changed implementation as such.

Shall I separate the big patch by rename, move and documentation or shall I separate it by functions? Both ways will create patches that will not compile.

The problem I think you are trying to solve here, is restarting failed transfers.

In my limited test scenario (10 peers in direct sight, each scheduling a single big bundle transfer with a time gap of 1 seconnd) I had two problems: 1) adding a bundle during serval daemon runtime did not lead to synchronization. 2) after resolving that - congestion??? prevented some of the peers to get all other 9 bundles even after waiting a very long time.

At least problem 1 can not be attributed to failed transfers I think.

The "sync bars" protocol should be considered deprecated. We should stop sending these packets, unless we receive one from a peer first. Or only if explicitly configured for testing purposes.

OK. Can be changed to only send sync bars announcements if there is at least one peer using this protocol or it is forced by configuration. Counting of peers per protocol is already implemented.

Having a hard coded transfer limit is not a good idea. One of the core design criteria for rhizome, is that important content travels fast. When we discover that a peer has something more important that we need, we must pause everything else and transfer that first. If we discover transfer_max_init videos of cats before we discover an SOS message, we must still transfer that emergency message first.

I used the transfer limit to reduce congestion??? in the test scenario (see above). But I'm still unsure why it helps. It seems if there are too many (>6 ???) bundle syncs are ongoing there is a high risk of one of the syncs are becoming stale.

The transfer limit can be calculated dynamically based on the load (number of transfers or any other metric) of a peer. I already tested this and it works. But this does not solve the priority problem.

I was also thinking of using a cross layer metric like the getting behind indication. But that would require changes to even more parts of the daemon that I do not fully understand.

You need to bear in mind that some transfers are rejected because the content is not interesting, and these rejected transfers should not be retried.

If due to a broken bundle a tree re-snyc is done I though only interesting bundles will be resynchronized.

/* Do a synchronisation tree re-sync if appropriate.

  Re-synchronization is done by deleting the current sync tree and incrementing the tree sequence number.

  Re-synchonisations is done if:
    There is no transfer ongoing
    _AND_ There was no re-sync done for at least an hour
          _OR_ We have broken bundles
               _AND_ the last broken bundle was detected after we (re-)started synchronization
               _AND_ the there was no new broken bundle within the last 10 seconds.
 */

A bundle that was broken before the last re-sync should not trigger a re-sync again.

What scenario do you have in mind?

IMHO it should be the receiver of the transfer that keeps track of other failures and attempts to restart them.

To do that we have to know from whom to request the bundle. In case of broken bundles we don't have that info any more. The key is already deleted from the sync tree.

I already implemented a simple mechanism for that - just ask a next hop peer randomly. That is why

/** Request peer to transfer bundle to us by sending manifest request to peer.

 @param peer Peer to send to.
 @param key Hash of manifest file
 @param bar bundle advertisement record or Null.
 @return transfer_count 1 if transfer was scheduled, 0 otherwise.
*/
static int sync_keys_transfer_request(struct subscriber *peer, const sync_key_t *key, const rhizome_bar_t *bar)

exists and can additionaly work with bundle manifests and not only with bars as before.

In the end I decided to remove that and replace it by the re-sync tree mechanism. The receiver triggers the tree re-sync and indirectly restarts the transfer.

What would be a proper solution to find out whom to ask for the bundle to restart the transfer directly?

Yes I think we do need some kind of sync token that can be used to wipe the peer's sync state. Mostly to handle software restarts. The key sync protocol will only drill down the tree to identify differences that both peers cannot yet explain. Once a node understands the difference, it will stop responding to a peer's attempts to query that part of the tree. However I think this is best achieved by sending each peer a copy of their sync_peer_state->root->message.

Sorry, I did not dive into sync keys very deep. Due to the lack of documentation I tried to use it as a black box. Is there any documentation besides reading the source code?

Both parties should agree on the contents of this node, if they don't they should throw the state away and start again. This should only be done when we aren't making progress, and have nothing more important to send.

This is what is done with the tree re-sync using the tree sequence number. The tree sequence number is added to the announcement.

How can that be done by the sync_peer_state->root->message?

With that in place, the receiver side can have confidence that enumerating the keys in this tree will correctly reflect the manifests they don't have, remember which onces should be transfered or were found to be uninteresting. So that transfers can be resumed after any failures.

I'm a little bit puzzled as you commented before

"You need to bear in mind that some transfers are rejected because the content is not interesting, and these rejected transfers should not be retried."

You propose a tree re-sync that is initiated by the sync_peer_state->root->message. So the method is the same, just the means to initiate the tree re-sync are different to my solution.

Does a tree re-sync trigger these unwanted retries or does it not trigger the unwanted retries?

lakeman commented 7 years ago

Yes, this needs better documentation. From the top down;

The hash of every manifest that we have is added to the sync tree; https://github.com/servalproject/serval-dna/blob/development/rhizome_sync_keys.c#L513

The root of that tree is stored in sync_state->root. Each node in the tree has a message; https://github.com/servalproject/serval-dna/blob/development/sync_keys.c#L42

Where the message key starts with prefix_len bits that match all leaf nodes under that node. And the remainder of the message key is the XOR of all leaf nodes. (prefix_len can also be thought of as the depth of this node within the tree).

While processing messages from peers, we build another tree in sync_peer_state->root. Where each leaf node describes a difference between the keys we have, and the keys this peer has. Either because we have something they do not, or they have something that we do not.

This sync process was originally prototyped and tested over here; https://github.com/lakeman/sync

I have a high degree of confidence that any group of nodes should reach consensus about the differences between these sync trees. So when you enumerate those differences, you should be able to trust them. If you turned on debug.rhizome_sync_keys you should see the same differences logged by any pair of nodes.

If you wipe your peer state and start again, without communicating this to other peers, I know that this confidence is lost. So as I said, I agree that we need some kind of token to demonstrate to peers that we have re-started and/or wiped our peer state. And if for any reason we aren't making any progress towards consensus, we need a way to fix that.

The generic way that sync_keys.c is written may need to change. So that, as I said before, when we have nothing else to send; https://github.com/servalproject/serval-dna/blob/development/sync_keys.c#L488

We can send each of our neighbours (reachable & REACHABLE_DIRECT) a copy of the peer->root message that relates to them. Which means that we need to know the struct subscriber * this relates to so we can encode their id.

Now the transfer process built on top of this sync tree obviously needs some (re)work. The fetching side of the transfer should keep some state about each transfer. Instead of hoping that nothing will go wrong during any step of the process. It should be able to fetch segments of the same payload from multiple neighbours, with stronger resume / retry semantics.

Note that a reason your test might fail is due to sqlite locking. We don't deal with that well everywhere. If you are adding bundles through the CLI interface, then we will have a situation where two processes are competing for exclusive file access to the DB. You can avoid that by using the restful interface instead. Now that the restful interface exists, we can re-write the cli interface to use the restful interface under the hood. Which would be another way of avoiding the DB locking problem.

Longer term we want to remove sqlite and use some other kind of storage / indexing layer.

On Tue, Feb 21, 2017 at 9:47 AM, b0661 notifications@github.com wrote:

Thank you for your comment.

This monolithic patch makes it quite difficult to separate where you have simply renamed or moved code, attempted to add documentation, vs where you have added functionality.

You are right, I failed to create a patch sequence that helps to understand - and I still do not know how to do. My intention was to get feedback on the changed implementation as such.

Shall I separate the big patch by rename, move and documentation or shall I separate it by functions? Both ways will create patches that will not compile.

The problem I think you are trying to solve here, is restarting failed transfers.

In my limited test scenario (10 peers in direct sight, each scheduling a single big bundle transfer with a time gap of 1 seconnd) I had two problems:

  1. adding a bundle during serval daemon runtime did not lead to synchronization.
  2. after resolving that - congestion??? prevented some of the peers to get all other 9 bundles even after waiting a very long time.

At least problem 1 can not be attributed to failed transfers I think.

The "sync bars" protocol should be considered deprecated. We should stop sending these packets, unless we receive one from a peer first. Or only if explicitly configured for testing purposes.

OK. Can be changed to only send sync bars announcements if there is at least one peer using this protocol or it is forced by configuration. Counting of peers per protocol is already implemented.

Having a hard coded transfer limit is not a good idea. One of the core design criteria for rhizome, is that important content travels fast. When we discover that a peer has something more important that we need, we must pause everything else and transfer that first. If we discover transfer_max_init videos of cats before we discover an SOS message, we must still transfer that emergency message first.

I used the transfer limit to reduce congestion??? in the test scenario (see above). But I'm still unsure why it helps. It seems if there are too many (>6 ???) bundle syncs are ongoing there is a high risk of one of the syncs are becoming stale.

The transfer limit can be calculated dynamically based on the load (number of transfers or any other metric) of a peer. I already tested this and it works. But this does not solve the priority problem.

I was also thinking of using a cross layer metric like the getting behind indication. But that would require changes to even more parts of the daemon that I do not fully understand.

You need to bear in mind that some transfers are rejected because the content is not interesting, and these rejected transfers should not be retried.

If due to a broken bundle a tree re-snyc is done I though only interesting bundles will be resynchronized.

/* Do a synchronisation tree re-sync if appropriate.

Re-synchronization is done by deleting the current sync tree and incrementing the tree sequence number.

Re-synchonisations is done if: There is no transfer ongoing AND There was no re-sync done for at least an hour OR We have broken bundles AND the last broken bundle was detected after we (re-)started synchronization AND the there was no new broken bundle within the last 10 seconds. */

A bundle that was broken before the last re-sync should not trigger a re-sync again.

What scenario do you have in mind?

IMHO it should be the receiver of the transfer that keeps track of other failures and attempts to restart them.

To do that we have to know from whom to request the bundle. In case of broken bundles we don't have that info any more. The key is already deleted from the sync tree.

I already implemented a simple mechanism for that - just ask a next hop peer randomly. That is why

/** Request peer to transfer bundle to us by sending manifest request to peer.

@param peer Peer to send to. @param key Hash of manifest file @param bar bundle advertisement record or Null. @return transfer_count 1 if transfer was scheduled, 0 otherwise. / static int sync_keys_transfer_request(struct subscriber peer, const sync_key_t key, const rhizome_bar_t bar)

exists and can additionaly work with bundle manifests and not only with bars as before.

In the end I decided to remove that and replace it by the re-sync tree mechanism. The receiver triggers the tree re-sync and indirectly restarts the transfer.

What would be a proper solution to find out whom to ask for the bundle to restart the transfer directly?

Yes I think we do need some kind of sync token that can be used to wipe the peer's sync state. Mostly to handle software restarts. The key sync protocol will only drill down the tree to identify differences that both peers cannot yet explain. Once a node understands the difference, it will stop responding to a peer's attempts to query that part of the tree. However I think this is best achieved by sending each peer a copy of their sync_peer_state->root->message.

Sorry, I did not dive into sync keys very deep. Due to the lack of documentation I tried to use it as a black box. Is there any documentation besides reading the source code?

Both parties should agree on the contents of this node, if they don't they should throw the state away and start again. This should only be done when we aren't making progress, and have nothing more important to send.

This is what is done with the tree re-sync using the tree sequence number. The tree sequence number is added to the announcement.

How can that be done by the sync_peer_state->root->message?

With that in place, the receiver side can have confidence that enumerating the keys in this tree will correctly reflect the manifests they don't have, remember which onces should be transfered or were found to be uninteresting. So that transfers can be resumed after any failures.

I'm a little bit puzzled as you commented before

"You need to bear in mind that some transfers are rejected because the content is not interesting, and these rejected transfers should not be retried."

You propose a tree re-sync that is initiated by the sync_peer_state->root->message. So the method is the same, just the means to initiate the tree re-sync are different to my solution.

Does a tree re-sync trigger these unwanted retries or does it not trigger the unwanted retries?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/servalproject/serval-dna/pull/120#issuecomment-281203913, or mute the thread https://github.com/notifications/unsubscribe-auth/AAkD3kz3KE0vacT7Hdl7EgBwL6iEuqugks5reh8QgaJpZM4MCGVp .

b0661 commented 7 years ago

I now understand that sync_state->root->message is a unique token for the current state of the sync tree. The sync tree describes - probably after some time for alignment - the current state of the local bundle store.

When synchonizing between peers there are at least two questions to be answered for the sync tree:

If we always wait to the end (however to detect that) of the synchonisation of the sync tree sync_state->root->message answers both questions. The sync tree sequence number answers question (1) already during synchronisation but does not answer question (2).

In my view it is acceptable to init bundle transfers (sending bars) using an only partially synchronized tree as long as we work on the right tree. If we postpone the transfer initialisation until we really know question (2) we may become stuck due to transmission failures.

If we use sync_state->root->message to answer question (1) during sync tree synchronisation I assume we may get false tree changed information because the sync tree may be currently updated from (or aligned to) the bundle store. For a big bundle store it may take some time until the sync tree really reflects the bundle store content.

I think also there are different actions needed if the answer is NO to one of the questions:

My conclusion:

If you agree, I would re-design my solution accordingly.

lakeman commented 7 years ago

I just pushed 9ec46f22, which should retry most operations that might fail due to database file locking. That should make the sync keys process much more reliable. Though some of the other suggestions above should still be done eventually.

lakeman commented 7 years ago

The sync tree will always include all bundles currently in our store. We add elements to this tree whenever a bundle is added to the store within the daemon process; https://github.com/servalproject/serval-dna/blob/development/rhizome_database.c#L1412

If the bundle store is written to by another process, it will be detected here when the next sync bar's alarm fires; https://github.com/servalproject/serval-dna/blob/development/rhizome_sync.c#L277

Since we plan to deprecate bar syncing, we will need a replacement for this trigger. Perhaps by sending a message to a running serval daemon with an implementation similar to "servald config sync".

Every time the sync process detects a difference between our tree and a peer's tree, we immediately queue the transfer with a SEND_BAR. But we don't need to wait for the transfer to complete before we can detect that the key sync process can explain all differences between trees.

Whenever we process a key sync message, we XOR away any differences between our tree and the peer's tree that we have already detected; https://github.com/servalproject/serval-dna/blob/development/sync_keys.c#L710

If there are no further differences in this [sub]tree, then we stop processing this message; https://github.com/servalproject/serval-dna/blob/development/sync_keys.c#L740 All other cases should provoke some kind of reply.

If we find that we can explain all differences in our root nodes (min_prefix == 0) we could mark the sync_peer_state to indicate that the key sync process has stabilised (for now). Clearing this mark any time we process a message with a difference we can't explain. This should at least be useful for debugging.

Storing a new bundle can evict older or larger bundles, but we don't currently remove these bundles from the tree in memory. If we've already completed the key sync process with a peer, changing our tree would force some network traffic to be generated. So if we do decide to update our tree, we'll need to consider the network congestion impact. Perhaps marking nodes for future removal, but only really removing them when a peer doesn't know about that key and it would cause even more traffic to send it.

b0661 commented 7 years ago

I think we are targeting different failure scenarios. My use cases are:

Your commit 9ec46f2 targets sqlite file locking. I assume my issues are not related to sqlite file locking. That's why my tests are still failing.

I amended the tests by these use cases - see patch 0001-Added-tests-for-rhizome-sync-behaviour-to-be-fixed.patch.txt

Could you please test in your environment. Maybe my environment is somehow ?wrong?.

lakeman commented 7 years ago

I'm fairly certain that we're talking about the same issues. I resolved some non-deterministic failures I was seeing while running tests/meshms test_MessageThreading(). Which also adds bundles using the cli while the daemon is running, then periodically reads bundles out until the sync is complete. This was getting in the way of testing the broadcast messaging feature I'm working on. While I've made running this existing test more reliable, I'm not claiming that I've fixed every case where a sqlite lock causes the bar / manifest / payload transfer steps to fail.

I'm less concerned about trying to recover from lost / corrupted content without a restart. I believe the most likely cause of such corruption would involve sudden power loss or failing hardware. I'm not saying we shouldn't have a fix for this case, and I will accept patches. Provided the changes don't compromise our other requirements.

So to restate what I think needs to be done;

Detect when two peers have failed to reach consensus on their tree differences, triggering a restart of the key sync process. But not a restart of any in progress transfers.

Remove keys from the tree intelligently. To communicate corrupt / missing payloads and trigger a re-download. Without wasting bandwidth communicating bundle evictions that are not going to be re-downloaded anyway.

Make sure any sqlite locking errors trickle up the call stack anywhere that might impact the transfer process. (and everywhere else...)

Ensure the transfer process can pause and resume after any sqlite busy errors.

Add a fall back restart of transfers, that doesn't cause excessive chatter if the transfer is legitimately rejected.

Eventually, re-write the transfer process to work in a similar manner to bittorrent; Transfer pieces from multiple peers, journal partial transfer state and reload after restart, ....

b0661 commented 7 years ago

Some design ideas:

Detect when two peers have failed to reach consensus on their tree differences, triggering a restart of the key sync process. But not a restart of any in progress transfers.

For consensus evaluation I propose a simple mechanism that gives 'credit points' to sync messages that provide new sync content.

A key sync process restart shall assure that both peers wipe out the other peer's sync tree info. Use the MSP connection and a classical 3-way handshake with a sync token to identify the re-sync run.

Remove keys from the tree intelligently. To communicate corrupt / missing payloads and trigger a re-download. Without wasting bandwidth communicating bundle evictions that are not going to be re-downloaded anyway

As this should usually be a rare situation a kind of out of band signalisation might be sufficient without restarting the sync process.

Instead of removing the key append the key to the key messages sent to indicate a search for the bundle for re-download. The keys may only be appended for bundles according to a choosen policy. On the reception of such a key the peer re-inserts the keys node into the peer sync tree associated to the sender (sender misses bundle) but does not update the tree otherwise. This will trigger the normal transfer init. The sender will receive a bar and as the bundle is now interesting (as it is broken) will request the download.

Add a fall back restart of transfers, that doesn't cause excessive chatter if the transfer is legitimately rejected.

Maybe the above mentioned out of band signalisation can also be used for this scenario.

Eventually, re-write the transfer process to work in a similar manner to bittorrent; Transfer pieces from multiple peers, journal partial transfer state and reload after restart, ...

The structure changes that I did should provide for the easy addition of an additional version 2 protocol with automatic switch over. Dividing bundles into chunks and synchronize the chunks by invertible bloom lookup tables may be a candidate.

lakeman commented 7 years ago

So, I've just pushed 585e573e to return BUSY codes for the remaining cases I've seen occur in rhizome sync transfers.

Plus we weren't dealing with adding the first a bundle to an empty store, so I've re-worked how we call our bundle_add trigger in 2540c9e6. It should now be easier to stop the old sync bars process, but I haven't done that, and have no plans to.

I believe that your test cases should work now. The other items from my earlier work list should still be done eventually to improve the transfer process. Now that I believe that these obvious bugs have been resolved, back to your suggestion above.

Failing to reach consensus in the tree sync should be rare and easily detectable. We need to bare in mind that the key sync process occurs over broadcast messages.

Consider the case of a collection of devices that successfully complete their sync process, followed by a new node entering the same airspace. This will cause each existing node to send lots of messages that are only useful to the new node. We don't want to destroy any state in this case.

b0661 commented 7 years ago

Thank you, now it works for me.

As expected missing/broken payload re-fetch without restart has still to be resolved.

b0661 commented 7 years ago

I plan to rework my patches according to your recommendations. I wonder whether the following code is good to detect whether two peers are in sync.

// decode a sync key message
int sync_message_decode(const uint8_t *message, size_t message_len, sync_key_t *key, int *stored, uint8_t *prefix_min_len, uint8_t *prefix_len)
{
  if (message_len < MESSAGE_BYTES) {
    return -1;
  }
  memcpy(&key->key[0], &message[2], KEY_LEN);
  *stored = (message[0] & 0x80) ? 1 : 0;
  *prefix_min_len = message[0] & 0x07;
  *prefix_len = message[1];
  return 0;
}

/** Process all incoming key messages from this packet buffer.
 *
 * @param state
 * @param peer_context
 * @param buff
 * @param len
 * @return interesting Number of interesting key messages in the packet buffer,
 *                     -1 on error
 *                     INT_MAX if we are in sync.
 */
int sync_recv_message(struct sync_state *state, void *peer_context, const uint8_t *buff, size_t len)
{
  struct sync_peer_state *peer_state = sync_alloc_peer_state(state, peer_context);

  if (len%MESSAGE_BYTES)
    return -1;

  int stored;
  uint8_t prefix_len_min;
  key_message_t message;

  // check for root message
  // must be the one and only message
  if (len == MESSAGE_BYTES && peer_state->root) {
    sync_message_decode(buff, len, &message.key, &stored,
                        &prefix_len_min, &message.prefix_len);
    if ((memcmp(&message.key, &peer_state->root->message.key, sizeof(sync_key_t)) == 0)
        && (prefix_len_min == peer_state->root->message.min_prefix_len)
        && (message.prefix_len == peer_state->root->message.prefix_len)) {
      // We got a message that is equal to our root message.
      // We are in sync.
      return INT_MAX;
    }
  }

  int recv_count_before = peer_state->recv_count;
  while(sync_message_decode(buff, len, &message.key, &stored,
                            &prefix_len_min, &message.prefix_len) == 0) {
    message.stored = stored;
    message.min_prefix_len = prefix_len_min;

    if (recv_key(state, peer_state, &message) == -1)
      return -1;

    buff += MESSAGE_BYTES;
    len -= MESSAGE_BYTES;
  }
  return peer_state->recv_count - recv_count_before;
}
lakeman commented 7 years ago

On Mon, Mar 13, 2017 at 8:40 PM, b0661 notifications@github.com wrote:

I plan to rework my patches according to your recommendations. I wonder whether the following code is good to detect whether two peers are in sync.

  • Must the min_prefix_len and prefix_len be of a special value or just equal to the one of the root message?
  • Is it always assured that the root message will be the only message?

// decode a sync key message int sync_message_decode(const uint8_t message, size_t message_len, sync_key_t key, int stored, uint8_t prefix_min_len, uint8_t prefix_len) { if (message_len < MESSAGE_BYTES) { return -1; } memcpy(&key->key[0], &message[2], KEY_LEN); stored = (message[0] & 0x80) ? 1 : 0; prefix_min_len = message[0] & 0x07; prefix_len = message[1]; return 0; }

So outlining this? https://github.com/servalproject/serval-dna/blob/development/sync_keys.c#L882 Probably better to keep the argument as a single key_message_t *.

/* Process all incoming key messages from this packet buffer.

  • @param state
  • @param peer_context
  • @param buff
  • @param len
  • @return interesting Number of interesting key messages in the packet buffer,
  • -1 on error
  • INT_MAX if we are in sync. / int sync_recv_message(struct sync_state state, void peer_context, const uint8_t buff, size_t len) { struct sync_peer_state *peer_state = sync_alloc_peer_state(state, peer_context);

    if (len%MESSAGE_BYTES) return -1;

    int stored; uint8_t prefix_len_min; key_message_t message;

    // check for root message // must be the one and only message if (len == MESSAGE_BYTES && peer_state->root) { sync_message_decode(buff, len, &message.key, &stored, &prefix_len_min, &message.prefix_len); if ((memcmp(&message.key, &peer_state->root->message.key, sizeof(sync_key_t)) == 0) && (prefix_len_min == peer_state->root->message.min_prefix_len) && (message.prefix_len == peer_state->root->message.prefix_len)) { // We got a message that is equal to our root message. // We are in sync. return INT_MAX;

Well, yes and no....

We don't want peers to think that they can transfer something from us that we don't have yet, so we don't add a key to our own tree until we have completed the transfer of the data. If syncing the tree's was our only objective, we could simply add every key we learn about to our own tree. This would simplify the code significantly. In the mean time, we want peers to continue the sync process until we understand everything that might need to be transfered.

peer_state->root->message.key is the XOR of just the keys that are different between our tree and this peer's tree. eg the keys they have that we do not, and the keys we have that they do not. Picture a ven diagram, things we know in the circle on the left, things they know on the circle on the right. We don't want or need to talk about the overlapping area of the circles. The peer_state->root tree contains all the other keys. When we have understood the differences, their root message XOR the peer_state->root->message.key should be equal to our root key.

remove_differences() XOR's these keys against any incoming message to subtract that information so we can see if there is anything new to learn from this message.

So, as I think I said some time ago, if we are looking at their root message (min_prefix_len==0), and there is no difference to ours, which we already compare here; https://github.com/servalproject/serval-dna/blob/development/sync_keys.c#L740 Then we think we are synced to this peer (they might disagree).

If any message gets passed that cmp_message()==0 test, then we know that we are not synced (they might disagree).

So, I think we need to track if we are only seeing uninteresting messages (with min_prefix_len>0) or messages where we know there is a difference we can't explain yet.

If we keep making this decision; https://github.com/servalproject/serval-dna/blob/development/sync_keys.c#L840 Due to receiving the same message(s) from the same peer then we need both peers to forget any nodes in this sub-tree.

What's a good way to trigger that? Well, that's complicated.

}

}

int recv_count_before = peer_state->recv_count; while(sync_message_decode(buff, len, &message.key, &stored, &prefix_len_min, &message.prefix_len) == 0) { message.stored = stored; message.min_prefix_len = prefix_len_min;

if (recv_key(state, peer_state, &message) == -1)
  return -1;

buff += MESSAGE_BYTES;
len -= MESSAGE_BYTES;

} return peer_state->recv_count - recv_count_before; }

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/servalproject/serval-dna/pull/120#issuecomment-286065178, or mute the thread https://github.com/notifications/unsubscribe-auth/AAkD3rD7R1jBC01izvcmVp0uS_2q7Flmks5rlRX6gaJpZM4MCGVp .

b0661 commented 7 years ago

if we are looking at their root message (min_prefix_len==0), and there is no difference to ours, which we already compare here; https://github.com/servalproject/serval-dna/blob/development/sync_keys.c#L740 Then we think we are synced to this peer (they might disagree). Probably better to keep the argument as a single key_message_t *.

Is this what you mean?

/** Process one incoming tree record.
 *
 * @param state
 * @param peer_state
 * @param message
 * @return status INT_MAX if we are synced to this peer, 0 otherwise.
 */
static int recv_key(struct sync_state *state, struct sync_peer_state *peer_state, const key_message_t *message)
{
  // sanity check on two header bytes.
...
    // Nothing to do if we understand the rest of the differences
    if (cmp_message(&peer_message, &node->message)==0){
      state->received_uninteresting++;
      if (node == state->root && peer_message.min_prefix_len == 0)
        // We are synced to this peer (they might disagree).
        return INT_MAX;
      return 0;
    }
...

/** Process all incoming key messages from this packet buffer.
 *
 * @param state
 * @param peer_context
 * @param buff
 * @param len
 * @return interesting Number of interesting key messages in the packet buffer,
 *                     -1 on error
 *                     INT_MAX if we are in synced to the peer.
 */
int sync_recv_message(struct sync_state *state, void *peer_context, const uint8_t *buff, size_t len)
{
  struct sync_peer_state *peer_state = sync_alloc_peer_state(state, peer_context);

  if (len%MESSAGE_BYTES)
    return -1;

  int status = 0;
  int recv_count_before = peer_state->recv_count;
  size_t offset=0;
  while(offset + MESSAGE_BYTES<=len){
    const uint8_t *p = &buff[offset];
    key_message_t message;
    bzero(&message, sizeof message);

    message.stored = (p[0]&0x80)?1:0;
    message.min_prefix_len = p[0]&0x7F;
    message.prefix_len = p[1];
    memcpy(&message.key.key[0], &p[2], KEY_LEN);

    status = recv_key(state, peer_state, &message);
    if (status < 0)
      return -1;

    offset+=MESSAGE_BYTES;
  }

  if (status == INT_MAX)
    // We are synced to this peer (they might disagree).
    return INT_MAX;
  return peer_state->recv_count - recv_count_before;
}

So, I think we need to track if we are only seeing uninteresting messages (with min_prefix_len>0) or messages where we know there is a difference we can't explain yet.

Can this also be made with the more simple detection mechanism described below? It may be improved later on.

We expect the sync consensus to improve over time. Consensus improves if we receive an interesting message or we get the synced status (indicated by INT_MAX above). If the consensus does not improve for a certain time (e.g. 60 seconds) we trigger a peer to peer resync.

we need both peers to forget any nodes in this sub-tree.

Can this also simplified by wiping out the whole peer state as it is done in the following code? It may be improved later on by exchanging the sub-trees to forget.

static void sync_keys_peer_state_resync(struct rhizome_sync_keys_peer_state *sync_keys_peer_state)
{
  DEBUGF(rhizome_sync_keys,"Resync (drop) peer synchronisation state of %s.", alloca_tohex_sid_t(sync_keys_peer_state->peer->sid));
  // Drop sync state of this peer.
  sync_free_peer_state(sync_keys_control.tree, sync_keys_peer_state->peer);
  sync_alloc_peer_state(sync_keys_control.tree, sync_keys_peer_state->peer);
  sync_keys_peer_state->sync_consensus_time = gettime_ms();
...

To assure that both peers wipe out the state I do a 3-way handshake on the MSP connection.

I tested the mechanism by setting the consensus timeout to 20 seconds and made my 10 peers big bundles transfer test. I could see two peer to peer re-syncs. From this very rudimentary test I can at least tell the peer to peer re-sync does not disturb any transfers and the whole sync operation.

Also all other rhizomeprotocol tests are passed. I have to find a test that creates a situation where the sync operation stalls.

b0661 commented 7 years ago

@lakeman do you have any comments to comment from 14th of March.

lakeman commented 7 years ago

I'm loath to wipe peer state unless we can somehow prove that the state is actually corrupt. If we have thousands of bundles, rebuilding that state could involve a lot of traffic.

One of your test cases must be;

So the protocol might be;