libp2p / go-libp2p-kad-dht

A Kademlia DHT implementation on go-libp2p
https://github.com/libp2p/specs/tree/master/kad-dht
MIT License
519 stars 222 forks source link

FindProvidersAsync misses a provider #750

Open nikosft opened 2 years ago

nikosft commented 2 years ago

TL;DR There is a race condition that causes FindProvidersAsync to miss a provider

I was experimenting with testground using a lossless network and TCP. In my experiment I am using 200 nodes, each providing a record. Then each node executes FindProvidersAsync for all 200 records. Many times 3-4 out of 40000 requests were not successful but this shouldn't happen.

Through trial and error I figured out that this was caused here

https://github.com/libp2p/go-libp2p-kad-dht/blob/5a8437e5aece37e5b4912f8e00dceb1f68f60cba/routing.go#L533

ps.TryAdd(prov.ID) increases the size of ps and this is a sufficient condition for this function to return true:

https://github.com/libp2p/go-libp2p-kad-dht/blob/5a8437e5aece37e5b4912f8e00dceb1f68f60cba/routing.go#L559

However, this causes followUpCtx to cancel hence the code jumps to

https://github.com/libp2p/go-libp2p-kad-dht/blob/5a8437e5aece37e5b4912f8e00dceb1f68f60cba/routing.go#L537

If this happens fast, the provider may not be returned back, since this happens in the previous line:

https://github.com/libp2p/go-libp2p-kad-dht/blob/5a8437e5aece37e5b4912f8e00dceb1f68f60cba/routing.go#L536

I (probably) solved this problem by using a lock. So lines 533-541 in my code look like this

peersetMutex.Lock()
if ps.TryAdd(prov.ID) {
    logger.Debugf("using provider: %s", prov)
    select {
    case peerOut <- *prov:
    case <-ctx.Done():
        logger.Debug("context timed out sending more providers")
        return nil, ctx.Err()
    }
}
peersetMutex.Unlock()

And then I modified the function in line 559 as follows:

func() bool {
    peersetMutex.Lock()
    output:= !findAll && ps.Size() >= count
    peersetMutex.Unlock()
    return output
}
vyzo commented 2 years ago

The scope of the lock is too long unfortunately, and it may be held for an arbitrary time period.

So we need a better solution.

On Sun, Oct 17, 2021, 15:31 Nikos Fotiou @.***> wrote:

TL;DR There is a race condition that causes FindProvidersAsync to miss a provider

I was experimenting with testground using a lossless network and TCP. In my experiment I am using 200 nodes, each providing a record. Then each node executes FindProvidersAsync for all 200 records. Many times 3-4 out of 40000 requests were not successful but this shouldn't happen.

Through trial and error I figured out that this was caused here

https://github.com/libp2p/go-libp2p-kad-dht/blob/5a8437e5aece37e5b4912f8e00dceb1f68f60cba/routing.go#L533

ps.TryAdd(prov.ID) increases the size of ps and this is a sufficient condition for this function to return true:

https://github.com/libp2p/go-libp2p-kad-dht/blob/5a8437e5aece37e5b4912f8e00dceb1f68f60cba/routing.go#L559

However, this causes followUpCtx to cancel hence the code jumps to

https://github.com/libp2p/go-libp2p-kad-dht/blob/5a8437e5aece37e5b4912f8e00dceb1f68f60cba/routing.go#L537

If this happens fast, the provider may not be returned back, since this happens in the previous line:

https://github.com/libp2p/go-libp2p-kad-dht/blob/5a8437e5aece37e5b4912f8e00dceb1f68f60cba/routing.go#L536

I (probably) solved this problem by using a lock. So lines 533-541 in my code look like this

peersetMutex.Lock()if ps.TryAdd(prov.ID) { logger.Debugf("using provider: %s", prov) select { case peerOut <- *prov: sentprov = 1 case <-ctx.Done(): logger.Debug("context timed out sending more providers") return nil, ctx.Err() } }peersetMutex.Unlock()

And then I modified the function in line 559 as follows:

func() bool { peersetMutex.Lock() output:= !findAll && ps.Size() >= count peersetMutex.Unlock() return output }

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/libp2p/go-libp2p-kad-dht/issues/750, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAI4ST7VWHI7JWLY5674BDUHK6ZRANCNFSM5GEYXY7A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

nikosft commented 2 years ago

In order to determine whether the bug is there, I used a variable that counted the records sent through peerOut. So the validation code looked similar to the following:

if ps.TryAdd(prov.ID) {
    logger.Debugf("using provider: %s", prov)
    select {
    case peerOut <- *prov:
             returnedItems++
    case <-ctx.Done():
        logger.Debug("context timed out sending more providers")
                if returnedItems < ps.Size() {
                      logger.Debug("!!!Something strange happened!!!")
                 } 
        return nil, ctx.Err()
    }
}

May be try something like that, i.e., add somewhere a condition that checks how many providers we have send back before closing the channel.

vyzo commented 2 years ago

@aschmahmann thoughts on the matter?

aschmahmann commented 2 years ago

@vyzo without looking too deeply at this I'd say I'd believe that a context cancellation race could occur here. There's some tradeoff around if the channel is allowed to backup and hold open a goroutine due to backpressure or if there's a maximal amount of waiting time. For example, using locking if findProvidersAsync returned a channel that didn't close until the caller had read everything off the channel it could be waiting for a while. That's not necessarily a bad thing, but it's a tradeoff here that doesn't seem very worthwhile.

This doesn't seem like a high priority item, but a fix that doesn't involve an excessive amount of locking seems reasonable. One fix that comes to mind is adding a condition in the cancel case that checks why it was cancelled, and if it was due to a Stop condition then decide when to abort. We'd have to decide what the abort semantics are though, such as:

WDYT?

nikosft commented 2 years ago

I run go test . -run TestProvidesMany -count=100 and three tests failed (with the error reported in issue #726). I modified https://github.com/libp2p/go-libp2p-kad-dht/blob/5a8437e5aece37e5b4912f8e00dceb1f68f60cba/routing.go#L533

As follows:

if ps.TryAdd(prov.ID) {
   peerOut <- *prov
}

Which removes the race condition. Then I run go test . -run TestProvidesMany -count=100 three times and gave my no error. I did not extensively test it but there is a chance that issue #726 is related to this one.

@aschmahmann WRT

One fix that comes to mind is adding a condition in the cancel case that checks why it was cancelled

The context included in the select statement is created here https://github.com/libp2p/go-libp2p-kad-dht/blob/5a8437e5aece37e5b4912f8e00dceb1f68f60cba/query.go#L108 and it is always stopped when the requested number of providers is found. So it is not an exceptional case. This context is stopped with every FindProvidersAsync call and, at least in our experiments, the race condition appears approximately once every 1000 requests.

vyzo commented 2 years ago

I don't think we should hang the goroutine -- immediately sending down the channel seems like the sanest option.