Open nikosft opened 2 years ago
The scope of the lock is too long unfortunately, and it may be held for an arbitrary time period.
So we need a better solution.
On Sun, Oct 17, 2021, 15:31 Nikos Fotiou @.***> wrote:
TL;DR There is a race condition that causes FindProvidersAsync to miss a provider
I was experimenting with testground using a lossless network and TCP. In my experiment I am using 200 nodes, each providing a record. Then each node executes FindProvidersAsync for all 200 records. Many times 3-4 out of 40000 requests were not successful but this shouldn't happen.
Through trial and error I figured out that this was caused here
ps.TryAdd(prov.ID) increases the size of ps and this is a sufficient condition for this function to return true:
However, this causes followUpCtx to cancel hence the code jumps to
If this happens fast, the provider may not be returned back, since this happens in the previous line:
I (probably) solved this problem by using a lock. So lines 533-541 in my code look like this
peersetMutex.Lock()if ps.TryAdd(prov.ID) { logger.Debugf("using provider: %s", prov) select { case peerOut <- *prov: sentprov = 1 case <-ctx.Done(): logger.Debug("context timed out sending more providers") return nil, ctx.Err() } }peersetMutex.Unlock()
And then I modified the function in line 559 as follows:
func() bool { peersetMutex.Lock() output:= !findAll && ps.Size() >= count peersetMutex.Unlock() return output }
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/libp2p/go-libp2p-kad-dht/issues/750, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAI4ST7VWHI7JWLY5674BDUHK6ZRANCNFSM5GEYXY7A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
In order to determine whether the bug is there, I used a variable that counted the records sent through peerOut. So the validation code looked similar to the following:
if ps.TryAdd(prov.ID) {
logger.Debugf("using provider: %s", prov)
select {
case peerOut <- *prov:
returnedItems++
case <-ctx.Done():
logger.Debug("context timed out sending more providers")
if returnedItems < ps.Size() {
logger.Debug("!!!Something strange happened!!!")
}
return nil, ctx.Err()
}
}
May be try something like that, i.e., add somewhere a condition that checks how many providers we have send back before closing the channel.
@aschmahmann thoughts on the matter?
@vyzo without looking too deeply at this I'd say I'd believe that a context cancellation race could occur here. There's some tradeoff around if the channel is allowed to backup and hold open a goroutine due to backpressure or if there's a maximal amount of waiting time. For example, using locking if findProvidersAsync
returned a channel that didn't close until the caller had read everything off the channel it could be waiting for a while. That's not necessarily a bad thing, but it's a tradeoff here that doesn't seem very worthwhile.
This doesn't seem like a high priority item, but a fix that doesn't involve an excessive amount of locking seems reasonable. One fix that comes to mind is adding a condition in the cancel case that checks why it was cancelled, and if it was due to a Stop condition then decide when to abort. We'd have to decide what the abort semantics are though, such as:
WDYT?
I run go test . -run TestProvidesMany -count=100
and three tests failed (with the error reported in issue #726). I modified
https://github.com/libp2p/go-libp2p-kad-dht/blob/5a8437e5aece37e5b4912f8e00dceb1f68f60cba/routing.go#L533
As follows:
if ps.TryAdd(prov.ID) {
peerOut <- *prov
}
Which removes the race condition. Then I run go test . -run TestProvidesMany -count=100
three times and gave my no error. I did not extensively test it but there is a chance that issue #726 is related to this one.
@aschmahmann WRT
One fix that comes to mind is adding a condition in the cancel case that checks why it was cancelled
The context included in the select
statement is created here https://github.com/libp2p/go-libp2p-kad-dht/blob/5a8437e5aece37e5b4912f8e00dceb1f68f60cba/query.go#L108 and it is always stopped when the requested number of providers is found. So it is not an exceptional case. This context is stopped with every FindProvidersAsync
call and, at least in our experiments, the race condition appears approximately once every 1000 requests.
I don't think we should hang the goroutine -- immediately sending down the channel seems like the sanest option.
TL;DR There is a race condition that causes FindProvidersAsync to miss a provider
I was experimenting with testground using a lossless network and TCP. In my experiment I am using 200 nodes, each providing a record. Then each node executes FindProvidersAsync for all 200 records. Many times 3-4 out of 40000 requests were not successful but this shouldn't happen.
Through trial and error I figured out that this was caused here
https://github.com/libp2p/go-libp2p-kad-dht/blob/5a8437e5aece37e5b4912f8e00dceb1f68f60cba/routing.go#L533
ps.TryAdd(prov.ID)
increases the size of ps and this is a sufficient condition for this function to returntrue
:https://github.com/libp2p/go-libp2p-kad-dht/blob/5a8437e5aece37e5b4912f8e00dceb1f68f60cba/routing.go#L559
However, this causes
followUpCtx
to cancel hence the code jumps tohttps://github.com/libp2p/go-libp2p-kad-dht/blob/5a8437e5aece37e5b4912f8e00dceb1f68f60cba/routing.go#L537
If this happens fast, the provider may not be returned back, since this happens in the previous line:
https://github.com/libp2p/go-libp2p-kad-dht/blob/5a8437e5aece37e5b4912f8e00dceb1f68f60cba/routing.go#L536
I (probably) solved this problem by using a lock. So lines 533-541 in my code look like this
And then I modified the function in line 559 as follows: