renproject / aw

A flexible P2P networking library for upgradable distributed systems.
MIT License
38 stars 18 forks source link

Incorrect usage of timeouts in gossip #70

Open rahulghangas opened 3 years ago

rahulghangas commented 3 years ago

While gosspping, two different timeouts are used

The first one defies the total time for a round of gossipping, while the second one defines the timeout for sending each message. When a sync message is received, the gossipper tries to propagate the new message by re-gossipping it. However, the gossipper now has no notion of the timeout supplied by the user (and that particular context has probably gone out scope!). Currently we use the second timeout to a call to Gossip, but the gossipping might fail if the first attempt to message a peer fails (since that uses the same timeout)

jazg commented 3 years ago

@rahulghangas Not sure I fully understand the issue/implication of the dual timeouts. Do you mind listing out a concrete example?

rahulghangas commented 3 years ago

The call to gossip has the following signature

func (g *Gossiper) Gossip(ctx context.Context, contentID []byte, subnet *id.Hash)

where the context provided usually has a timeout of, say 5-10 seconds. On the other hand, if you look at didReceiveSync

func (g *Gossiper) didReceiveSync(from id.Signatory, msg wire.Msg) {
...
...
...
ctx, cancel := context.WithTimeout(context.Background(), g.opts.Timeout)
defer cancel()

g.Gossip(ctx, msg.Data, &subnet)

after receiving the sync message, we regossip the message, but using the internal timeout. The internal timeout is supposed to be used to define the time limit of sending a single message, not a whole round of gossipping.

jazg commented 3 years ago

@rahulghangas Got it. Could we send to the recipients in parallel inside the Gossip function to resolve this? I will check with @loongy to gauge the intended usage of the timeout.

rahulghangas commented 3 years ago

That could potentially solve the issue, and the call to gossip wouldn't require a context as input since we'll use the internal timeout to define a context and use it on all messages (in parallel)