project-iris / iris

Decentralized cloud messaging
iris.karalabe.com
Other
570 stars 32 forks source link

Overlay message loss during churn (graceful shutdown) #12

Open karalabe opened 10 years ago

karalabe commented 10 years ago

Currently the pastry overlay features no graceful shutdown mechanism to avoid message loss. At the lower session level graceful termination has been already implemented, but a leave operation should be added to pastry to prevent peers from sending further messages.

The proto/pastry/routing_test.go has been extended to simulate messaging during churn, but a powerful enough machine is needed to actually catch the bug (i.e. enough cores to have one linger at the exact "wrong" place during shutdown).

karalabe commented 10 years ago

Hmm, actually the test was racy. So still need a proper test that manages to lose a message during churn.

karalabe commented 10 years ago

Tear-down seems to be fairly stable now. Probably there are still places for messages to disappear, but without some sophisticated large scale tests, it will be hard to catch them.

On the other hand overlay joins are still potential message sinks as the joining node is immediately inserted into active routing tables, even though it is just initializing itself. There is a glaring race condition when other nodes start sending the joiner messages to forward, but the joiners routing table is empty.

A two phase join should theoretically solve this: the node issues a pastry join as before, but that should not put it into active routing tables. Instead, when the node converges, it would send an activation message to all connected peers.